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Abstract 

We propose an extensive analysis of the behavior of majority votes in binary classification. 
In particular, we introduce a risk bound for majority votes, called the C-bound, that takes 
into account the average quality of the voters and their average disagreement. We also 
propose an extensive PAC-Bayesian analysis that shows how the C-bound can be estimated 
from various observations contained in the training data. The analysis intends to be self- 
contained and can be used as introductory material to PAC-Bayesian statistical learning 
theory. It starts from a general PAC-Bayesian perspective and ends with uncommon PAC- 
Bayesian bounds. Some of these bounds contain no Kullback-Leibler divergence and others 
allow kernel functions to be used as voters (via the sample compression setting). Finally, 
out of the analysis, we propose the MinCq learning algorithm that basically minimizes the 
C-bound. MinCq reduces to a simple quadratic program. Aside from being theoretically 
grounded, MinCq achieves state-of-the-art performance, as shown in our extensive empirical 
comparison with both AdaBoost and the Support Vector Machine. 

Keywords: majority vote, ensemble methods, learning theory, PAC-Bayesian theory, 

sample compression 


1. Previous Work and Implementation 

This paper can be considered as an extended version of Lacasse et al. (2006) and Laviolette 
et al. (2011), and also contains ideas from Laviolette and Marchand (2005, 2007) and Ger¬ 
main et al. (2009, 2011). We unify this previous work, revise the mathematical approach, 
add new results and extend empirical experiments. 

The source code to compute the various PAC-Bayesian bounds presented in this paper 
and the implementation of the MinCq learning algorithm is available at: 

http://graal.ift.ulaval.ca/maj orityvote/ 
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2. Introduction 

In binary classification, many state-of-the-art algorithms output prediction functions that 
can be seen as a majority vote of “simple” classifiers. Firstly, ensemble methods such 
as Bagging (Breiman, 1996), Boosting (Schapire and Singer, 1999) and Random Forests 
(Breiman, 2001) are well-known examples of learning algorithms that output majority votes. 
Secondly, majority votes are also central in the Bayesian approach (see Gelman et ah, 2004, 
for an introductory text); in this setting, the majority vote is generally called the Bayes 
Classifier. Thirdly, it is interesting to point out that classifiers produced by kernel methods, 
such as the Support Vector Machine (SVM) (Cortes and Vapnik, 1995), can also be viewed 
as majority votes. Indeed, to classify an example x, the SVM classifier computes 



where k (•, •) is a kernel function, and the input-output pairs ( x^yi ) represent the examples 
from the training set S. Thus, one can interpret each y, k(xi , •) as a voter that chooses (with 
confidence level \k(xi,x)\) between two alternatives (“positive” or “negative”), and a* as the 
respective weight of this voter in the majority vote. Then, if the total confidence-multiplied 
weight of each voter that votes positive is greater than the total confidence-multiplied weight 
of each voter that votes negative, the classifier outputs a +1 label (and a —1 label in the 
opposite case). Similarly, each neuron of the last layer of an artificial neural network can 
be interpreted as a majority vote, since it outputs a real value given by I\(Yhi w i9i{. x )) f° r 
some activation function K. 1 

In practice, it is well known that the classifier output by each of these learning algorithms 
performs much better than any of its voters individually. Indeed, voting can dramatically 
improve performance when the “community” of classifiers tends to compensate for individual 
errors. In particular, this phenomenon explains the success of Boosting algorithms ( e.g ., 
Schapire et ah, 1998). The first aim of this paper is to explore how bounds on the generalized 
risk of the majority vote are not only able to theoretically justify learning algorithms but also 
to detect when the voted combination provably outperforms the average of its voters. We 
expect that this study of the behavior of a majority vote should improve the understanding 
of existing learning algorithms and even lead to new ones. We indeed present a learning 
algorithm based on these ideas at the end of the paper. 

The PAC-Bayesian theory is a well-suited approach to analyze majority votes. Initiated 
by McAllester (1999), this theory aims to provide Probably Approximately Correct guar¬ 
antees (PAC guarantees) to “Bayesian-like” learning algorithms. Within this approach, one 
considers a prior 2 distribution P over a space of classifiers that characterizes its prior belief 
about good classifiers (before the observation of the data) and a posterior distribution Q 
(over the same space of classifiers) that takes into account the additional information pro¬ 
vided by the training data. The classical PAC-Bayesian approach indirectly bounds the risk 

1. In this case, each voter gt has incoming weights which are also learned (often by back propagation of 
errors) together with the weights Wi. The analysis presented in this paper considers fixed voters. Thus, 
the PAC-Bayesian theory for artificial neural networks remains to be done. Note however that the recent 
work by McAllester (2013) provides a first step in that direction. 

2. Priors have been used for many years in statistics. The priors in this paper have only indirect links with 
the Bayesian priors. We nevertheless use this language, since it comes from previous work. 
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of a Q-weighted majority vote by bounding the risk of an associate (stochastic) classifier, 
called the Gibbs classifier. A remarkable result, known as the “PAC-Bayesian Theorem”, 
provides a risk bound for the “true” risk of the Gibbs classifier, by considering the empir¬ 
ical risk of this Gibbs classifier on the training data and the Kullback-Leibler divergence 
between a posterior distribution Q and a prior distribution P. It is well known (Lang¬ 
ford and Shawe-Taylor, 2002; McAllester, 2003b; Germain et ah, 2009) that the risk of the 
(deterministic) majority vote classifier is upper-bounded by twice the risk of the associ¬ 
ated (stochastic) Gibbs classifier. Unfortunately, and especially if the involved voters are 
weak, this indirect bound on the majority vote classifier is far from being tight, even if the 
PAC-Bayesian bound itself generally gives a tight bound on the risk of the Gibbs classifier. 
In practice, as stated before, the “community” of classifiers can act in such a way as to 
compensate for individual errors. When such compensation occurs, the risk of the majority 
vote is then much lower than the Gibbs risk itself and, a fortiori, much lower than twice the 
Gibbs risk. By limiting the analysis to Gibbs risk only, the commonly used PAC-Bayesian 
framework is unable to evaluate whether or not this compensation occurs. Consequently, 
this framework cannot help in producing highly accurate voted combinations of classifiers 
when these classifiers are individually weak. 

In this paper, we tackle this problem by studying the margin of the majority vote as 
a random variable. The first and second moments of this random variable are respectively 
linked with the risk of the Gibbs classifier and the expected disagreement between the voters 
of the majority vote. As we will show, the well-known factor of two used to bound the risk 
of the majority vote is recovered by applying Markov’s inequality to the first moment of the 
margin. Based on this observation, we show that a tighter bound, that we call the C-bound, 
is obtained by considering the first two moments of the margin, together with Chebyshev’s 
inequality. 

Section 4 presents, in a more detailed way, the work on the C-bound originally presented 
in Lacasse et al. (2006). We then present both theoretical and empirical studies that show 
that the C-bound is an accurate indicator of the risk of the majority vote. We also show that 
the C-bound can be smaller than the risk of the Gibbs classifier and can even be arbitrarily 
close to zero even if the risk of the Gibbs classifier is close to 1/2. This indicates that 
the C-bound can effectively capture the compensation of the individual errors made by the 
voters. 

We then develop PAC-Bayesian guarantees on the C-bound in order to obtain an upper 
bound on the risk of the majority vote based on empirical observations. Section 5 presents 
a general approach of the PAC-Bayesian theory by which we recover the most commonly 
used forms of the bounds of McAllester (1999, 2003a) and Langford and Seeger (2001); 
Seeger (2002); Langford (2005). Thereafter, we extend the theory to obtain upper bounds 
on the C-bound in two different ways. The first method is to separately bound the risk 
of the Gibbs classifier and the expected disagreement—which are the two fundamental 
ingredients that are present in the C-bound. Since the expected disagreement does not 
rely on labels, this strategy is well-suited for the semi-supervised learning framework. The 
second method directly bounds the C-bound and empirically improves the achievable bounds 
in the supervised learning framework. 
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Sections 6 and 7 bring together relatively new PAC-Bayesian ideas that allow us, for one 
part, to derive a PAC-Bayesian bound that does not rely on the Kullback-Leibler divergence 
between the prior and posterior distributions (as in Catoni, 2007; Germain et ah, 2011; 
Laviolette et ah, 2011) and, for the other part, to extend the bound to the case where the 
voters are defined using elements of the training data, e.g., voters defined by kernel functions 
Uik(xi,-). This second approach is based on the sample compression theory (Floyd and 
Warmuth, 1995; Laviolette and Marchand, 2007; Germain et ah, 2011). In PAC-Bayesian 
theory, the sample compression approach is a priori problematic, since a PAC-Bayesian 
bound makes use of a prior distribution on the set of all voters that has to be defined before 
observing the data. If the voters themselves are defined using a part of the data, there is 
an apparent contradiction that has to be overcome. 

Based on the foregoing, a learning algorithm, that we call MinCq, is presented in Sec¬ 
tion 8. The algorithm basically minimizes the C-bound, but in a particular way that is, inter 
alia, justified by the PAC-Bayesian analysis of Sections 6 and 7. This algorithm was origi¬ 
nally presented in Laviolette et ah (2011). Given a set of voters (either classifiers or kernel 
functions), MinCq builds a majority vote classifier by finding the posterior distribution Q 
on the set of voters that minimizes the C-bound. Hence, MinCq takes into account not 
only the overall quality of the voters, but also their respective disagreements. In this way, 
MinCq builds a “community” of voters that can compensate for their individual errors. 
Even though the C-bound consists of a relatively complex quotient, the MinCq learning 
algorithm reduces to a simple quadratic program. Moreover, extensive empirical experi¬ 
ments confirm that MinCq is very competitive when compared with AdaBoost (Schapire 
and Singer, 1999) and the Support Vector Machine (Cortes and Vapnik, 1995). 

In Section 9, we conclude by pointing out recent work that uses the PAC-Bayesian 
theory to tackle more sophisticated machine learning problems. 


3. Basic Definitions 

We consider classification problems where the input space X is an arbitrary set and the 
output space is a discrete set denoted y. An example (x, y) is an input-output pair where 
x € X and y G V- A voter is a function X —> y for some output space y related to y. 
Unless otherwise specified, we consider the binary classification problem where y = { — 1,1} 
and then we either consider y as y itself, or its convex hull [—1, H-l] . In this paper, we 
also use the following convention: / denotes a real-valued voter (i.e., y = [—1,1]), and h 
denotes a binary-valued voter (i.e., y = {—1,1}). Note that this notion of voters is quite 
general, since any uniformly bounded real-valued set of functions can be viewed as a set of 
voters when properly normalized. 

We consider learning algorithms that construct majority votes based on a (finite) set T~L 
of voters. Given any x G X, the output Bq(x) of a Q- weighted majority vote classifier Bq 
(sometimes called the Bayes classifier ) is given by 


b q{x) 


def 

= sgn 


E /(*) 


where sgn(a) = 1 if a > 0, sgn(a) = — 1 if a < 0, and sgn(0) = 0. 


(2) 
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Thus, in case of a tie in the majority vote - i.e., E/~q/(x) = 0 we consider that the 
majority vote classifier abstains - i.e., Bq(x ) = 0. There are other possible ways to handle 
this particular case. In this paper, we choose to define sgn(0) = 0 because it simplifies the 
forthcoming analysis. 

We adopt the PAC setting where each example (x, y) is drawn i.i.d. according to a 
fixed, but unknown, probability distribution D on if xji. The training set of m examples 
is denoted by S = ( (x\, y \),..., (x m , y m )) ~ D m . Throughout the paper, D' generically 
represents either the true (and unknown) distribution D, or its empirical counterpart U 5 
(i.e., the uniform distribution over the training set S ). Moreover, for notational simplicity, 
we often replace U 5 by S. 

In order to quantify the accuracy of a voter, we use a loss function C : T xT —>• [0,1] . 
The PAC-Bayesian theory traditionally considers majority votes of binary voters of the form 
h : X -A- { — 1,1}, and the zero-one loss C 01 (h(x ), y) = f l(h(x) / y ) , where 1(a) = 1 if 
predicate a is true and 0 otherwise. 

The extension of the zero-one loss to real-valued voters (of the form / : X —> [—1,1]) is 
given by the following definition. 

Definition 1 In the (more general) case where voters are functions / : X —>• [—1,1], the 
zero-one loss C 01 is defined by 

An {f( x )iy) d = i{y-f{x)< 0 ). 

Hence, a voter abstention - i.e., when f(x) outputs exactly 0 - results in a loss of 1. Clearly, 
other choices are possible for this particular case . 3 

In this paper, we also consider the linear loss C( defined as follows. 

Definition 2 Given a voter / : X —» [— 1,1], the linear loss Ln is defined by 

Be{f(x),y) d = ^(l -y/(x)). 

Note that the linear loss is equal to the zero-one loss when the output space is binary. That 
is, for any (h(x),y) G { — 1,1 } 2 , we always have 

C e (h(x),y) = C m (h(x),y) , (3) 

because Ci(h(x), y) = 1 if h(x ) / y, and C(.(h(x),y) = 0 if h(x) = y. Hence, we generalize 
all definitions implying classifiers to voters using the equality of Equation (3) as an inspi¬ 
ration. Figure 1 illustrates the difference between the zero-one loss and the linear loss for 
real-valued voters. Remember that in the case y f(x) = 0 , the loss is 1 (see Definition 1). 

Definition 3 Given a loss function C and a voter /, the expected loss E £/(/) of / relative 
to distribution D' is defined as 

E Uf) = , E C(f(x),y). 

3. As an example, when f(x) outputs 0, the loss may be 1/2. However, we choose for this unlikely event 
the worst loss value - i.e., C ol (0,y) = 1 because it simplifies the majority vote analysis. 
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Figure 1: The zero-one loss £ 01 and the linear loss La as a function of yf(x). 


In particular, the empirical expected loss on a training set S is given by 

1 m 

E h/) = -£ £(/&),*)■ 

HI . 

1=1 


We therefore define the risk of the majority vote Rd'(Bq) as follows. 

Definition 4 For any probability distribution Q on a set of voters, the Bayes risk R£>' ( Bq ), 
also called risk of the majority vote , is defined as the expected zero-one loss of the majority 
vote classifier Bq relative to D' . Hence, 


Rd'(Bq) ^ E fy(B Q ) = 


E 

(x,y)~D' 


r (^BQ(x) y^j = 


E 

(x,y)~D' 


E y ■ f(x) < 0 


Remember from the definition of Bq (Equation 2) that the majority vote classifier abstains 
in the case of a tie on an example (x, y ). Therefore, the above Definition 4 implies that the 
Bayes risk is 1 in this case, as R^x^^Bq) =C 01 (0, y) = 1. In practice, a tie in the vote is a 
rare event, especially if there are many voters. 

The output of the deterministic majority vote classifier Bq is closely related to the 
output of a stochastic classifier called the Gibbs classifier. To classify an input example x, 
the Gibbs classifier Gq randomly chooses a voter / according to Q and returns f(x). Note 
the stochasticity of the Gibbs classifier: it can output different values when given the same 
input x twice. We will see later how the link between Bq and Gq is used in the PAC- 
Bayesian theory. 

In the case of binary voters, the Gibbs risk corresponds to the probability that Gq 
misclassifies an example of distribution D'. Hence, 

= Pr 

(x,y)~D 
h~Q 

In order to handle real-valued voters, we generalize the Gibbs risk as follows. 


Rd'(Gq ) 


,(M*)^y) = E E^i(h) = E 

h~Q [x,y)^ 


D' 


E I(h(x)fi y). 


Definition 5 For any probability distribution Q on a set of voters, the Gibbs risk Rd'(Gq) 
is defined as the expected linear loss of the Gibbs classifier Gq relative to D'. Hence, 


Rd'(Gq) d = E 
V f~Q 


4u/) = ^ 


1 - 


E 

(x,y)~D' 


E y 

f~Q 


f(x] 
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Remark 6 It is well known in the PAC-Bayesian literature (e.g., Langford and Shawe- 
Taylor, 2002; McAllester, 2003b; Germain et al., 2009) that the Bayes risk Rd/(Bq ) is 
bounded by twice the Gibbs risk -Rd'(G'q). This statement extends to our more general 
definition of the Gibbs risk (Definition 5). 

Proof Let (x,y) € X x { — 1,1} be any example. We claim that 

R {(x, y )) ( b q) < 2R ((x,y))(G Q ). (4) 

Notice that R/u^^Bq) is either 0 or 1 depending of the fact that Bq errs or not on (x,y). 

In the case where Ru x ^(Bq) = 0, Equation (4) is trivially true. If Ru x ^(Bq) = 1, we 

know by the last equality of Definition 4 that E y ■ f(x) < 0. Therefore, Definition 5 gives 

/~Q 

2 ' r ((x,v)){ g q) = 2 ' 2 “ f^ Q y ' f( x )) ^ 1 = R {{x,y))( B Q) i 

which proves the claim. 

Now, by taking the expectation according to (x,y) ~ D' on each side of Equation (4), 
we obtain 


r D'(Bq) 
as wanted. 


E R i( xv \\{Bn) < E 

(x,y)~D' " x ’ y) > y W (x,y)~D 


, 2 R {{x,y)){Go) = 2 R D i(Gq), 


Thus, PAC-Bayesian bounds on the risk of the majority vote are usually bounds on 
the Gibbs risk, multiplied by a factor of two. Even if this type of bound can be tight in 
some situations, the factor two can also be misleading. Langford and Shawe-Taylor (2002) 
have shown that under some circumstances, the factor of two can be reduced to (1 + e). 
Nevertheless, distributions Q on voters giving Rj^/{Gq) 2> R.o'(Bq) are common. The 
extreme case happens when the expected linear loss on each example is just below one half 
- i.e., for all (x,y), E f^Q y f(x) = e -, leading to a perfect majority vote classifier but 
an almost inaccurate Gibbs classifier. Indeed, we have Rd'{Gq) = \—e and Rd'{Bq ) = 0. 
Therefore, in this circumstance, the bound Rd'(Bq) < 1—2e, given by Remark 6, fails to 
represent the perfect accuracy of the majority vote. This problem is due to the fact that the 
Gibbs risk only considers the loss of the average output of the population of voters. Hence, 
the bound of Remark 6 states that the majority vote is weak whenever every individual voter 
is weak. The bound cannot capture the fact that it might happen that the “community” of 
voters compensates for individual errors. To overcome this lacuna, we need a bound that 
compares the output of voters between them, not only the average quality of each voter 
taken individually. 

We can compare the output of binary voters by considering the probability of disagree¬ 
ment between them: 

Pr (hi(x) / h 2 (xj) = 

hi,h,2~Q 


E E 

x^D' x hi~Q h 2 ~Q 


5 q 1 { hi ^ ^ M-'c)) 

E l(h±(x) ■ h 2 {x) ^ 1^ 


E E 

x^D'x hi~Q h 2 ~Q 

E E E C 01 (h 1 (x)-h 2 (x), l), 

x~D x h\r^Q h 2 ~Q 
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where D' x denotes the marginal on X of distribution D'. Definition 7 extends this notion 
of disagreement to real-valued voters. 

Definition 7 For any probability distribution Q on a set of voters, the expected disagree¬ 
ment dq relative to D' is defined as 


d° Q = 


E E 


/i~Q /2~Q 


E £e{h(x)-f2(x), l) 


l 


= - 1 - E 


E 


x~D' x /i~Q /2~Q 


E 1 • fi(x) • / 2 (x) 


= - 1 - E 


X ~ D X L' 


e m 

J~Q 


Notice that the value of dq does not depend on the labels y of the examples (x,y) ~ D'. 
Therefore, we can estimate the expected disagreement with unlabeled data. 


4. Bounds on the Risk of the Majority Vote 

The aim of this section is to introduce the C-bound, which upper-bounds the risk of the 
majority vote (Definition 4) based on the Gibbs risk (Definition 5) and the expected dis¬ 
agreement (Definition 7). We start by studying the margin of a majority vote as a random 
variable (Section 4.1). From the first moment of the margin, we easily recover the well- 
known bound of twice the Gibbs risk presented by Remark 6 (Section 4.2). We therefore 
suggest extending this analysis to the second moment of the margin to obtain the C-bound 
(Section 4.3). Finally, we present some statistical properties of the C-bound (Section 4.4) 
and an empirical study of its predictive power (Section 4.5). 

4.1 The Margin of the Majority Vote and its Moments 

The bounds on the risk of a majority vote classifier proposed in this section result from the 
study of the weighted margin of the majority vote as a random variable. 

Definition 8 Let Mq be the random variable that, given any example (x, y ) drawn ac¬ 
cording to D', outputs the margin of the majority vote Bq on that example, which is 

Mq(x, y) d = E y ■ f(x). 

From Definitions 4 and 8 , we have the following nice property : 4 

Rd'(Bq) = Pr (Mq(x,y)<Q\. (5) 

(x,y)~D f \ / 

4. Note that for another choice of the zero-one loss definition (Definition 1), the tie in the majority vote - 
i.e., when Mq(x,u) = 0 - would have been more complicated to handle, and the statement should have 
been relaxed to 

Pr (Mq{x,v)< o) < Rd'(Bq) < Pr (m q (x, y) < o) . 
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The margin is not only related to the risk of the majority vote, but also to Gibbs risk. 
For that purpose, let us consider the first moment mi (Mq ) of the random variable Mq 
which is defined as 

E Mq(x, y). (6) 

(x,y)~D' 

We can now rewrite the Gibbs risk (Definition 5) as a function of /i i (Mq ), since 


Rd'(Gq ) 


E E£',(/) 
i~Q 


1 

2 


( 1 - E E 

V (x,y)~D' f~Q 


y ■ f(x] 


1 

2 

1 

2 


0 ~^D' MQiX ’ y) ) 

(l - M i(M£')). 


(7) 


Similarly, we can rewrite the expected disagreement as a function of the second moment 
of the margin. We use M 2 (Mq ) to denote the second moment. Since y G {—1,1} and, 
therefore, y 2 = 1, the second moment of the margin does not rely on labels. Indeed, we 
have 



E 

(x,y)~D' L 


Mq(x, y) 


1 2 


E y 2 

(x,y)~D' 

E 

x^D' L/~Q 


f E o f{x) 
J~Q 


1 2 


E /(*) 


1 2 


( 8 ) 


Hence, from the last equality and Definition 7, the expected disagreement can be expressed 
as 


d° Q 



(9) 


Equation (9) shows that 0 < dg < 1/2, since 0 < fi 2 (Mq ) < 1. Furthermore, we 
can upper-bound the disagreement more tightly than simply saying it is at most 1/2 by 
making use of the value of the Gibbs risk. To do so, let us write the variance of the margin 
as 


Var (Mg) Var (M Q (x,y)) 

(x,y)~D' 

= H 2 ( Mq ) - ( mi ( M ^')) 2 . ( 10 ) 

Therefore, as the variance cannot be negative, it follows that 

M2(M/f) > (mi (M/T)) 2 , 
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which implies that 

1-2 -d% > (1 — 2 • R D i(Gq)) 2 . (11) 

Easy calculation then gives the desired bound of dq (that is based on the Gibbs risk): 

d D q < 2-R d ,(G q )-(\-R di {G q )). ( 12 ) 

We therefore have the following proposition. 

Proposition 9 For any distribution Q on a set of voters and any distribution D' on 
X x { — 1,1}, we have 

dq < 2 • R D t{Gq) ■ (l - R D '(Gq)) < 

Moreover, if dq = \ then RofGq) = \ ■ 

Proof Equation (12) gives the first inequality. The rest of the proposition directly fol¬ 
lows from the fact that f(x) = 2 x(l — x) is a parabola whose (unique) maximum is at the 
point (1,1). ■ 

4.2 Rediscovering the bound Rc,i(Bq) < 2 • Rd>(Gq) 

The well-known factor of two with which one can transform a bound on the Gibbs risk 
Rd'{Gq) into a bound on the risk Rf)/(Bq ) of the majority vote is usually justified by 
an argument similar to the one given in Remark 6. However, as shown by the proof of 
Proposition 10, the result can also be obtained by considering that the risk of the majority 
vote is the probability that the margin Mq is lesser than or equal to zero (Equation 5) and 
by simply applying Markov’s inequality (Lemma 46, provided in Appendix A). 

Proposition 10 For any distribution Q on a set of voters and any distribution D' on 
Xx{ — 1, 1}, we have 

Rd'(Bq) < 2 -R D i(Gq). 

Proof Starting from Equation (5) and using Markov’s inequality (Lemma 46), we have 
Rd'(Bq) = Pr (M Q (x,y)< 0) 

(x,y)~D f 

Pr (l~ M Q { x ,y)>l) 

(x,y)~D' 

< E (l — Mq(x, y)) (Markov’s inequality) 

{x,y)~D' 

= 1- E M Q (x,y) 

(x : y)~D' 

= 1 — h\(Mq) 

= 2 • R d >(G q ) . 

The last equality is directly obtained from Equation (7). ■ 
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(a) First form. 


(b) Second form. 


(c) Third form. 


Figure 2: Contour plots of the C-bound. 


This proof highlights that we can upper-bound Rd/(Bq) by considering solely the first 
moment of the margin /). Once we realize this fact, it becomes natural to extend 
this result to higher moments. We do so in the following subsection where we make use of 
Chebyshev’s inequality (instead of Markov’s inequality), which uses not only the first, but 
also the second moment of the margin. This gives rise to the C-bound of Theorem 11. 


4.3 The C-bound: a Bound on R£>'{Bq) That Can Be Much Smaller Than Rjo'{Gq) 

Here is the bound on which most of the results of this paper are based. We refer to it as the 
C-bound. It was first introduced (but in a different form) in Lacasse et al. (2006). 5 We give 
here three different (but equivalent) forms of the C-bound. Each one highlights a different 
property or behavior of the bound. Figure 2 illustrates these behaviors. 

It is interesting to note that the proof of Theorem 11 below has the same starting point as 
the proof of Proposition 10, but uses Chebyshev’s inequality instead of Markov’s inequality 
(respectively Lemmas 48 and 46, both provided in Appendix A). Therefore, Theorem 11 is 
based on the variance of the margin in addition of its mean. 


Theorem 11 (The C-bound) For any distribution Q on a set of voters and any distri¬ 
bution D' on X x {—1,1}, if Hi(M q) > 0 (i.e., Rd'{Gq) < 1/2), we have 


where 


pD' def 

l Q - 


Var(Mq) 


First form 


= 1 - 


Rd'(Bq) < Cq , 

(/*!« 


H2 (Mq ) 


= 1 - 


1 - 2 


Rd'{Gq)^ 


1 - 2 • <*§' 


Second form 


Third form 


5. We present the form used by Lacasse et al. (2006) in Remark 12 at the end of the present subsection. 
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Proof Starting from Equation (5) and using the one-sided Chebyshev inequality (Lem¬ 
ma 48), with X = —Mq(x, y), fi = E (— Mq(x, y)) and a = E Mq(x, y). we obtain 

(x,y)~D' y (x,y)~D' 


Rd’ ( Bq ) = Pr ^ (m Q {x, y) < o) 


,y)~D> 

= , ? r , ( ~ m q( x ,v) +, E M q (x, y) > E M q (x, y) 


< 


(x,y)~D' \ (x,y)~D' 

. Var {Mq(x, y)) 

(x,y)~D' 

Var (Mq(x, y)) + ( E M Q (x,y) 


x,y)~D’ 


(• x,y)~D’ 


[x,y)~D' 


Var (Mg) 


h2 (M£') -( Ml (M§')) +(/ii(M§') 

MMg) 

( mi ( M §')) 2 

(l- 2 -h>^(G Q )) 2 

l- 2 -< 


(Chebyshev’s inequality) 
Var(M§') 


M 2 (M°') 


(13) 


(14) 


(15) 


Lines (13) and (14) respectively present the first and the second forms of Cq , and follow 
from the definitions of hi(Mq ), fi 2 (Mq ), and Var(M q ) (see Equations 6, 8 and 10). 
The third form of Cq is obtained at Line (15) using n\(M q) = 1 — 2 • Rd>(Gq) and 
Hz^Mq ) = 1 — 2 • dq , which can be derived directly from Equations (7) and (9). ■ 

The third form of the C-bound shows that the bound decreases when the Gibbs risk Re>i(Gq) 
decreases or when the disagreement dq increases. This new bound therefore suggests that 
a majority vote should perform a trade-off between the Gibbs risk and the disagreement 
in order to achieve a low Bayes risk. This is more informative than the usual bound of 
Proposition 10, which focuses solely on the minimization of the Gibbs risk. 

The first form of the C-bound highlights that its value is always positive (since the 
variance and the second moment of the margin are positive), whereas the second form of the 
C-bound highlights that it cannot exceed one. Finally, the fact that dq = \ =4- Rd’(Gq ) = ^ 
(Proposition 9) implies that the bound is always defined, since Rd'(Gq) is here assumed to 
be strictly less than 


Remark 12 As explained before, the C-bound was originally stated in Lacasse et al. (2006), 
but in a different form. It was presented as a function of Wq(x, y), the Q-weight of voters 
making an error on example (x,y). More precisely, the C-bound was presented as follows: 


r D 

C Q 


Var 

(■ x,y)~D 1 


{W Q (x,y)) 


Var {W Q (x,y)) + (1/2 - R D \G Q )Y 

{x,y)~V' 
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It is easy to show that this form is equivalent to the three forms stated in Theorem 11, and 
that \-Vq(x. y) and Mq(x,u) are related by 

W Q (x,y) = f E^C e (f{x), y) = * ^1 - y ■ /(x)^ = * (l — Mq(x, yfj . 

However, we do not discuss further this form of the C-bound here, since we now consider 
that the margin Mg(x,y) is a more natural notion than Wq(x,i/). 


4.4 Statistical Analysis of the C-bound’s Behavior 

This section presents some properties of the C-bound. In the first place, we discuss the 
conditions under which the C-bound is optimal, in the sense that if the only information 
that one has about a majority vote is the first two moments of its margin distribution, it 
is possible that the value given by the C-bound is the Bayes risk, i.e., Cq = Rd'{Bq). & 
In the second place, we show that the C-bound can be arbitrarily small, especially in the 
presence of “non-correlated” voters, even if the Gibbs risk is large, i.e., Cq <C Rd'(Gq). 


4.4.1 Conditions of Optimality 

For the sake of simplicity, let us focus on a random variable M that represents a margin 
distribution (here, we ignore underlying distributions Q on "H and D' on Ax{—1,1}) of 
first moment ^i(Af) and second moment By Equation (5), we have 

Hof 

R(B m ) = Pr (M < 0). (16) 


Moreover, R(Bm) is upper-bounded by Cm, the C-bound given by the second form of 
Theorem 11, 


(MM )) 2 

M2 (M) 


(17) 


The next proposition shows when the C-bound can be achieved. 


Proposition 13 (Optimality of the C-bound) Let M be any random variable that rep¬ 
resents the margin of a majority vote. Then there exists a random variable M such that 

Mi (M) = Mi (M) , ix 2 (M) = M 2 (M) , and C^ = C M = R{B~) (18) 


if and only if 


0 < pl 2 {M) < mi (M). 


(19) 


Proof First, let us show that (19) implies (18). Given 0 < M 2 (M) < m(M), we consider 
a distribution M concentrated in two points defined as 


M 


0 

M2(A7) 

, Mi (AO 


with probability Cm = 1 — 


with probability 1 — Cm = 


(Mi(M)) 2 
M2 (M) 
(Mi {M)f 
M2 (M) 


6. In other words, the optimality of the C-bound means here that there exists a random variable with the 
same first moments as the margin distribution, such that Chebyshev’s inequality of Lemma 48 is reached. 
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This distribution has the required moments, as 


Aii {M) = 


(Aii (M)Y 
MM) 


MM) 

Aii {M) 


= lii (M), and MM) = 


{MM)) 


2r MM) 12 


MM) 


Aii (M) 


= MM). 


It follows directly from Equation (17) that = Cm- Moreover, by Equation (16) and 
because >l2 [ A n I r \ > 0, we obtain as desired 

Ul\ M ) ’ 


R(B^) = Pr (M < 0) = C M . 


Now, let us show that (18) implies (19). Consider a distribution M such that the 
equalities of Line (18) are satisfied. By Proposition 10 and Equation (7), we obtain the 
inequality 

Cm = R{Bj^) < 1 — AM (M) = 1 — m (M). 

Hence, by the definition of Cm, we have 


{MM)f 

AM (M) 


< l-l«i(M), 


which, by straightforward calculations, implies 0 < /z 2 (M) < ji\(M ), and we are done. ■ 


We discussed in Section 4.1 the multiple connections between the moments of the margin, 
the Gibbs risk and the expected disagreement of a majority vote. In the next proposition, 
we exploit these connections to derive expressions equivalent to Line (19) of Proposition 13. 
Thus, this shows three (equivalent) necessary conditions under which the C-bound is opti¬ 
mal. 


Proposition 14 For any distribution Q on a set of voters and any distribution D' on 
X x{ — 1,1}, if hi(Mq ) > 0 (i.e., Rd’{Gq) < 1/2J, then the three following statements are 
equivalent: 

(i) A MM§) < hi{Mq) ; 

(ii) R D i(G q ) < d% ; 

(in) Cg < 2 R d ,{G q ) . 


Proof The truth of (i) 44 (ii) is a direct consequence of Equations (7) and (9). To prove 
(ii) 44 (in), we express Cq in its third form. Straightforward calculations give 


Cg = 1- 


(1-2 R D ,(G Q )f 

1-2 dg 


< 2 Rd’(Gq) 


Rd'(G q ) < dg 


Propositions 13 and 14 illustrate an interesting result: the C-bound is optimal if and only if 
its value is lower than twice the Gibbs risk, the classical bound on the risk of the majority 
vote (see Proposition 10). 
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4.4.2 The C-bound Can Be Arbitrarily Small, Even for Large Gibbs Risks 

The next result shows that, when the number of voters tends to infinity (and the weight of 
each voter tends to zero), the variance of Mq will tend to 0 provided that the average of the 
covariance of the outputs of all pairs of distinct voters is < 0. In particular, the variance 
will always tend to 0 if the risk of the voters is pairwise independent. To quantify the 
independence between voters, we use the concept of covariance of a pair of voters (/i, f 2 ): 


Co v c '(/i,/ 2 ) 


= f , Cov (y ■ h(x), y • f 2 (xj) 


E 

(a 




( E 

\(x,y)~D' 



( E 

\(x,y)~D' 



Note that the covariance Cov D ' (/i, f 2 ) is zero when f\ and f 2 are independent (uncorrelated). 


Proposition 15 For any countable set of voters R, any distribution Q on R, and any 
distribution D' on X x{—1,1} , we have 

Var (Mg) < E^ 2 (/)+E E Q(h)Q(h)-Cov°'(h,f 2 ). 

fen heuf 2 eu-. 

f2^fl 

Proof By the definition of the margin (Definition 8 ), we rewrite Mq(x, y) as a sum of 
random variables: 

, Var (M Q (x,y)) 

(x,y)~D' \ / 

= , ^rv 1 E ~y /(*) ) 

{X ' V) ~ D \jtn J 

E Q(h)Q(h) Cov (y-fi(x),y-f 2 (x) 
hen-- y ,y ’ 

/2?Vl 

The inequality is a consequence of the fact that V/ G % : Var (^ y • <1. ■ 

The key observation that comes out of this result is that is usually much 

smaller than one. Consider, for example, the case where Q is uniform on "H with \H\ = n. 
Then Q 2 (f) = 1 / n ■ Moreover, if Cov ,y {f\. f 2 ) < 0 for each pair of distinct classifiers 

in "H, then Var (Mq) < 1/n. Hence, in these cases, we have that Cq G 0{l/n) whenever 
1—2 Rd’{Gq) and 1—2 dq are larger than some positive constants independent of n. Thus, 
even when Rd>(Gq) is large, we see that the C-bound can be arbitrarily close to 0 as we 
increase the number of classifiers having non-positive pairwise covariance of their risk. More 
precisely, we have 


= Eo 1 


fen 


(/) Var 


, [y fix )J 


hen 


Corollary 16 Given n independent voters under a uniform distribution Q, we have 


Rd>(Bq) < eg < 


1 


n■ (1 — 2 dq 


< 


1 


n- (1 — 2 Rjy 


2 ' 
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Proof The first inequality directly comes from the C-bound (Theorem 11). The second 
inequality is a consequence of Proposition 15, considering that in the case of a uniform 
distribution of independent voters, we have Cov d '(/i,/ 2 ) = 0, and then Var (Mg ) < 1/n. 
Applying this to the first form of the C-bound, we obtain 


,jy = V^Mq) 
Q M2 (Mg) 


Var (Mg') 1 

1 _ 2 dg 1 — 2 dq 


1 

n ‘( 1 -2 d q ) 


To obtain the third inequality, we simply apply Equation (11), 


and we are done. ■ 


4.5 Empirical Study of The Predictive Power of the C-bound 

To further motivate the use of the C-bound, we investigate how its empirical value relates 
to the risk of the majority vote by conducting two experiments. The first experiment shows 
that the C-bound clearly outperforms the individual capacity of the other quantities of 
Theorem 11 in the task of predicting the risk of the majority vote. The second experiment 
shows that the C-bound is a great stopping criterion for Boosting algorithms. 

4.5.1 Comparison with Other Indicators 

We study how Rd'(Gq ), Yai(Mg), dg and Cg are respectively related to Rd'(Bq). Note 
that these four quantities appear in the first form or the third form of the C-bound (Theo¬ 
rem 11). We omit here the moments ) and H 2 (-Mq ) required by the second form of 

the C-bound, as there is a linear relation between m{M q ) and Rd'(Gq), as well as between 
l-t'AMg ) and dg . 

The results of Figure 3 are obtained with the AdaBoost algorithm of Schapire and Singer 
(1999), used with “decision stumps” as weak learners, on several UCI binary classification 
data sets (Blake and Merz, 1998). Each data set is split into two halves: a training set S 
and a testing set T. We run AdaBoost on set S for 100 rounds and compute the quantities 
Rt{Gq), Var (Mq), dg and Cg on set T at every 5 rounds of boosting. That is, we study 
20 different majority vote classifiers per data set. 

In Figure 3a, we see that we almost always have Rt{Bq) < Rt{Gq). There is, however, 
no clear correlation between Rt(Bq) and Rt{Gq). We also see no clear correlation between 
Rt{Bq) and Var (Mg) or between Rt(Bq) and dg in Figures 3b and 3c respectively, except 
that generally Rt(Bq) > Var(Mg) and Rt{Bq ) < dg. In contrast, Figure 3d shows a 
strong correlation between Cg and Rt{Bq). Indeed, it is almost a linear relation! Therefore, 
the C-bound seems well-suited to characterize the behavior of the Bayes risk, whereas each 
of the individual quantities contained in the C-bound is insufficient to do so. 

4.5.2 The C-bound as a Stopping Criterion for Boosting 

We now evaluate the accuracy of the empirical value of the C-bound as a model selection 
tool. More specifically, we compare its ability to act as a stopping criterion for the AdaBoost 
algorithm. 
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(b) Variance of the margin. 



Rt{Bq) 
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Rt(Bq) 


(c) Expected disagreement. 


(d) C-bound. 


Figure 3: Rt(Bq) versus Rt{Gq ), Var(Mg), dg and Cq respectively. 


We use the same version of the algorithm and the same data sets as in the previous 
experiment. However, for this experiment, each data set is split into a training set S of 
at most 400 examples and a testing set T containing the remaining examples. We run 
AdaBoost on set S for 1000 rounds. At each round, we compute the empirical C-bound 
Cq (on the training set). Afterwards, we select the majority vote classifier with the lowest 
value of Cq and compute its Bayes risk Rt(Bq) (on the test set). We compare this stopping 
criterion with three other methods. For the first method, we compute the empirical Bayes 
risk R s (Bq ) at each round of boosting and, after that, we select the one having the lowest 
such risk.' The second method consists in performing 5-fold cross-validation and selecting 
the number of boosting rounds having the lowest cross-validation risk. Finally, the third 
method is to reserve 10% of S as a validation set, train AdaBoost on the remaining 90%, 

7. When several iterations have the same value of Rs(Bq), we select the earlier one. 
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Data Set Information Risk Rt(Bq) by Stopping Criterion (and number of rounds performed) 


Name \S\ |T| C-bound Cq Risk R$(Bq) Validation Set Cross-Validation 1000 rounds 


Adult 

400 

11409 

0.166 

(149) 

0.169 

(314) 

0.165 

(13) 

0.166 

(97) 

0.172 

Breast Cancer 

341 

342 

0.050 

(127) 

0.047 

(48) 

0.041 

(57) 

0.047 

(108) 

0.058 

Credit-A 

326 

327 

0.187 

(346) 

0.199 

(854) 

0.156 

(9) 

0.174 

(47) 

0.199 

Glass 

107 

107 

0.252 

(72) 

0.196 

(299) 

0.346 

(6) 

0.290 

(35) 

0.196 

Haberman 

147 

147 

0.320 

(27) 

0.320 

(45) 

0.279 

(1) 

0.320 

(38) 

0.340 

Heart 

148 

149 

0.215 

(124) 

0.289 

(950) 

0.181 

(31) 

0.195 

(14) 

0.289 

Ionosphere 

175 

176 

0.085 

(210) 

0.120 

(56) 

0.142 

(2) 

0.114 

(67) 

0.085 

Letter:AB 

400 

1155 

0.005 

(42) 

0.014 

(17) 

0.061 

(2) 

0.005 

(60) 

0.010 

Letter:DO 

400 

1158 

0.041 

(179) 

0.041 

(44) 

0.143 

(1) 

0.044 

(83) 

0.043 

Letter: OQ 

400 

1136 

0.050 

(65) 

0.050 

(138) 

0.063 

(26) 

0.044 

(118) 

0.049 

Liver 

172 

173 

0.289 

( 541 ) 

0.289 

(743) 

0.335 

(5) 

0.289 

(603) 

0.295 

Mushroom 

400 

7724 

0.010 

(612) 

0.024 

(38) 

0.079 

(6) 

0.024 

(51) 

0.010 

Sonar 

104 

104 

0.192 

(688) 

0.250 

(20) 

0.317 

(2) 

0.163 

(34) 

0.202 

Tic-tac-toe 

400 

558 

0.389 

(59) 

0.364 

(2) 

0.358 

(5) 

0.403 

(9) 

0.389 

US votes 

217 

218 

0.032 

(11) 

0.041 

(598) 

0.032 

(16) 

0.028 

(1) 

0.046 

Waveform 

400 

7600 

0.101 

(145) 

0.102 

(178) 

0.106 

(13) 

0.103 

(22) 

0.115 

Wdbc 

284 

285 

0.049 

(40) 

0.060 

(19) 

0.091 

(2) 

0.046 

(10) 

0.060 


Statistical Comparison Tests 

Cq vs R S (Bq) Cq vs Validation Set Cq vs Cross-Validation Cq vs 1000 rounds 

Poisson binomial test 91% 86% 57% 90% 

Sign test (p-value) 0.05 0.23 0.60 0.02 


Table 1: Comparison of various stopping criteria over 1000 rounds of boosting. The Poisson 
binomial test gives the probability that Cq is a better stopping criterion than every 
other approach. The sign test gives a p -value representing the probability that the 
null hypothesis is true (ie., the Cq stopping criterion has the same performance 
as every other approach). 


and keep the majority vote with the lowest Bayes risk on the validation set. Note that this 
last method differs from the others because AdaBoost sees 10% fewer examples during the 
learning process, but this is the price to pay for using a validation set. 

Table 1 compares the Bayes risks on the test set Rt(Bq) of the majority vote classifiers 
selected by the different stopping criteria. We compute the probability of C-bound being 
a better stopping criteria than every other methods with two statistical tests: the Poisson 
binomial test (Lacoste et ah, 2012) and the sign test (Mendenhall, 1983). Both statistical 
tests suggest that the empirical C-bound is a better model selection tool than the empirical 
Bayes risk (as usual in machine learning tasks, this method is prone to overfitting) and 
the validation set (although this method performs very well sometimes, it suffers from 
the small quantity of training examples on several tasks). The empirical C-bound and the 
cross-validation methods obtain a similar accuracy. However, the cross-validation procedure 
needs more running time. We conclude that the empirical C-bound is a surprisingly good 
stopping criterion for Boosting. 
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5. A PAC-Bayesian Story: From Zero to a PAC-Bayesian C-bound 

In this section, we present a PAC-Bayesian theory that allows one to estimate the C-bound 
value Cq from its empirical estimate Cq. From there, we derive bounds on the risk of 
the majority vote Rd(Bq) based on empirical observations. We first recall the classical 
PAC-Bayesian bound (here called the PAC-Bound 0) that bounds the true Gibbs risk by its 
empirical counterpart. We then present two different PAC-Bayesian bounds on the majority 
vote classifier (respectively called PAC-Bounds 1 and 2). A third bound, PAC-Bound 3, 
will be presented in Section 6. This analysis intends to be self-contained, and can act as an 
introduction to PAC-Bayesian theory. 8 

The first PAC-Bayesian theorem was proposed by McAllester (1999). Given a set of 
voters "H, a prior distribution P on 7-L chosen before observing the data, and a posterior 
distribution Q on "H chosen after observing a training set S^D m (Q is typically chosen by 
running a learning algorithm on S). PAC-Bayesian theorems give tight risk bounds for the 
Gibbs classifier Gq. These bounds on Rd(Gq) usually rely on two quantities: 

a) The empirical Gibbs risk Rs(Gq), that is computed on the m examples of S, 

1 m 

i =1 J 

b) The Kullback-Leibler divergence between distributions Q and P, that measures “how 
far” the chosen posterior Q is from the prior P , 

KL(QH-P) = E ln§^. (20) 

Note that the obtained PAC-Bayesian bounds are uniformly valid for all possible posteri¬ 
ors Q. 

In the following, we present a very general PAC-Bayesian theorem (Section 5.1), and we 
specialize it to obtain a bound on the Gibbs risk Rd(Gq) that is converted in a bound on 
the risk of the majority vote Rd(Bq) by the factor 2 of Proposition 10 (Section 5.2). Then, 
we define new losses that rely on a pair of voters (Section 5.3). These new losses allow 
us to extend the PAC-Bayesian theory to directly bound Rd{Bq) through the C-bound 
(Sections 5.4 and 5.5). For each proposed bound, we explain the algorithmic procedure 
required to compute its value. 

5.1 General PAC-Bayesian Theory for Real-Valued Losses 

A key step of most PAC-Bayesian proofs is summarized by the following Change of measure 
inequality (Lemma 17). 

We present here the same proof as in Seldin and Tishby (2010) and McAllester (2013). 
Note that the same result is derived from Fenchel’s inequality in Banerjee (2006) and 
Donsker-Varadhan’s variational formula for relative entropy in Seldin et al. (2012); Tol- 
stikhin and Seldin (2013). 

8. We also recommend the “practical prediction tutorial” of Langford (2005), that contains an insightful 
PAC-Bayesian introduction. 
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Lemma 17 (Change of measure inequality) For any set FL, for any distributions P 
and Q onFL, and for any measurable function cf> : FL —>• M, we have 


E ) 6(f) < KL(Q||P) + ln( E e™ 
J~Q \J~P 


Proof The result is obtained by simple calculations, exploiting the definition of the KL- 
divergence given by Equation (20), and then Jensen’s inequality (Lemma 47, in Appendix A) 
on concave function ln(-) : 


E <#■(/) = 

/~<y 


< 


E lne^ (/) = E In f • SA • e^ (/) 
/~Q /~Q 


KL(Q||P) + E In 


\P(f) Q(f) 

HP 


KL(Q||P) + In E 


(m. e 

Wf) 

P(f) 


f~Q Q(f) 


■ e 


m 


(Jensen’s inequality) 


< 


KL(Q||P) + ln 


E e 

f~P 


m 


Note that the last inequality becomes an equality if Q and P share the same support. ■ 


Let us now present a general PAC-Bayesian theorem which bounds the expectation of 
any real-valued loss function C : y X y —> [0,1]. This theorem is slightly more general than 
the PAC-Bayesian theorem of Germain et al. (2009, Theorem 2.1), that is specialized to the 
expected linear loss, and therefore gives rise to a bound of the “generalized” Gibbs risk of 
Definition 5. A similar result is presented in Tolstikhin and Seldin (2013, Lemma 1). 

Theorem 18 (General PAC-Bayesian theorem for real-valued losses) For any dis¬ 
tribution D on X x y, for any set FL of voters X y, for any loss C, : y x y -t [ 0 , 1 ], for 
any prior distribution P on FL, for any 5 £ (0,1], for any m! > 0, and for any convex 
function V : [0, 1] x [0, 1] — >• M ? we have 

( For all posteriors Q on FL : 

where KL(Q||P) is the Kullback-Leibler divergence between Q and P of Equation (20). 

Most of the time, this theorem is used with m' = m, the size of the training set. However, 
as pointed out by Lever et al. (2010), m! does not have to be so. One can easily show 
that different values of m! affect the relative weighting between the terms KL(Q||P) and 
In (|Es^i>nEj r ^pe m, ' I, ( E s(-f)’ E D(/))) in the bound. Hence, especially in situations where 
these two terms have very different values, a “good” choice for the value of m' can tighten 
the bound. 

Proof Note that E > e £(/)) is a non-negative random variable. By Markov’s 

inequality (Lemma 46, in Appendix A), we have 

Pr ( E e ™'^( E s(/)> E £(/)) < I E E e m '- v ^sU),KLf)) \ >\-S 

S~D m \f~P 5 s~d™ f~P J 


KL(Q||P)+ln( E ( E p e m '- 27 ( E s(P- 1 E S(/))')l I - 1 6 ’ 
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Hence, by taking the logarithm on each side of the innermost inequality, we obtain 

> 1 - 6 . 


S~D m 


(In 

E e m'-I5(Ef(/),E£(/)) 

< In 

I E E e m, ' x ’( E s(^)’ E o(/)) 

V 

J~ p 


/~P 


We apply the change of measure inequality (Lemma 17) on the left side of innermost in¬ 
equality, with cj)(f) = rn! • P(Eg(/), E£(/)). We then use Jensen’s inequality (Lemma 47, 
in Appendix A), exploiting the convexity of V : 


V Q on "H : 


In 


E e m'-D(E§(/),E&(/)) 
f~P 


> m'- Ei P(Eg(/),E£(/))-KL(Q||P) 

> m'- P(EE £(/), EE £(/)) - KL(Q\\P). 

j~Q 


We therefore have 


/ For all posteriors Q : 


( to'- D( E E;§(/), E E£(/)) - KL(Q||P)< In 

\ J ~W 

i E E 

5 S~D™ f~P 


>1-5. 


The result then follows from easy calculations. ■ 

As shown in Germain et al. (2009), the general PAC-Bayesian theorem can be used to 
recover many common variants of the PAC-Bayesian theorem, simply by selecting a well- 
suited function V. Among these, we obtain a similar bound as the one proposed by Langford 
and Seeger (2001); Seeger (2002); Langford (2005) by using the Kullback-Leibler divergence 
between the Bernoulli distributions with probability of success q and probability of success p: 


kl(<?||p) = f <?hi — + (1 — q) In ^-- 

p 1 — p 


( 21 ) 


Note that kl(< 7 1| p) is a shorthand notation for KL(<5||P) of Equation (20), with Q = ( q , 1 —q) 
and P = (jp, 1 —p). Corollary 50 (in Appendix A) shows that kl(g || p) is a convex function. 

In order to apply Theorem 18 with T>(q,p) = kl(g||p) and m! = m, we need the next lemma. 


Lemma 19 For any distribution D on X x T, for any voter f : X —> y, for any loss 
C : y xjl —> [0,1], and any positive integer rn, we have 


E 

S~D m 


exp 


m-kl(Eg(/)||E£(/)) 


< C(m), 


where 



Moreover, yfm < f(m) < 2y / m. 


( 22 ) 


Proof Let us introduce a random variable Xf that follows a binomial distribution of m 
trials with a probability of success E£(/). Hence, Xf H(m,E£(/)). 
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As e m ' kl (' II e d(/)) } s a convex function, Lemma 51 (due to Maurer, 2004, and provided 
in Appendix A), shows that 


E exp 


m-kl E&(/)||Efc 


< E exp 




We then have 

E g™ kl (^WHE£(/)) 


X f ~B(m, E£(/)) 


E 






1 - —A 


m—X 


f 


f 


V E £(/)/ 

A, = k 


V Pr 


' £ \ k i 


1 - ^ 


E 

fc =0 

m 

E 

k=0 


E 


D 


l-Efe 


Eg(/), 

m—k 


,1 -E£(/i) 

A \ fc 

m \ 


m—k 


l - ^ 

m 


m—k 


E C D U) V i-ie£(/) 


m 


- 1 -- 


m 


m—k 


= CM- 


Maurer (2004) shows that £(m) A 2^^ for m > 8 , and £(m) > yTn for m > 2. However, 
the cases for m G {1, 2, 3,4, 5, 6 , 7} are easy to verify computationally. ■ 

Theorem 20 below specializes the general PAC-Bayesian theorem to 'D(q.p) = kl(g||p), but 
still applies to any real-valued loss functions. This theorem can be seen as an intermediate 
step to obtain Corollary 21 of the next section, which uses the linear loss to bound the Gibbs 
risk. However, Theorem 20 below is reused afterwards in Section 5.3 to derive PAC-Bayesian 
theorems for other loss functions. 

Theorem 20 For any distribution D on X xy, for any set Ft of voters X — > y, for any 
loss C : y x y —»• [0,1], for any prior distribution P on Ft, for any 5 G (0,1], we have 

‘ For all posteriors Q on Ft : 


Pr | / 

\ kl E Eg (/) 


/~Q 


E En 

f~Q 


1 

< — 
m 


KL(Q||P) + ln 


CM 


> 1 - 8 . 


Proof By Theorem 18, with V(q,p) = kl(q||p) and w! = m, we have 
, V Q on Ft: 

KL(Q||P)+ln 


Pr 


- E E e m ' k k E s(/)ll E h(/))] 
5S~D™f~P 


> 1 - 6 . 


As the prior P is independent of S, we can swap the two expectations in E E e m ' k i(-||-)_ 

/~P 

This observation, together with Lemma 19, gives 


E E e m - kl ( E g(/)H E £(/)) = E E e m ' kl ^Ulll E SUh < E £(m) = £(m) 

S~D™ f~P f~P S~D m f~p ’ ' 


0 m-kl(Ef(/)||E£(/)) 
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5.2 PAC-Bayesian Theory for the Gibbs Classifier 

This section presents two classical PAC-Bayesian results that bound the risk of the Gibbs 
classifier. One of these bounds is used to express a first PAC-Bayesian bound on the risk of 
the majority vote classifier. Then, we explain how to compute the empirical value of this 
bound by a root-finding method. 

5.2.1 PAC-Bayesian Theorems for the Gibbs Risk 

We interpret the two following results as straightforward corollaries of Theorem 20. Indeed, 
from Definition 5, the expected linear loss of a Gibbs classifier Gq on a distribution D' is 
Rd'{Gq). These two Corollaries are very similar to well-known PAC-Bayesian theorems. At 
first, Corollary 21 is similar to the PAC-Bayesian theorem of Langford and Seeger (2001); 
Seeger (2002); Langford (2005), with the exception that In is replaced by In Since 
£(m) < 2 \Jm < m + 1, this result gives slightly better bounds. Similarly, Corollary 22 
provides a slight improvement of the PAC-Bayesian bound of McAllester (1999, 2003a). 

Corollary 21 (Langford and Seeger, 2001; Seeger, 2002; Langford, 2005) For any distri¬ 
bution D on X x{— 1,1}, for any set Ft of voters X —> [—1,1], for any prior distribution P 
on Ft, and any 5 £ (0,1], we have 

( For all posteriors Q on Ft : 

*1(Rs(Gq)\\R d (Gq)) < ^ 

Proof The result is directly obtained from Theorem 20 using the linear loss C = Cg to 
recover the Gibbs risk of Definition 5. ■ 


KL(Q||P) + ln 


f[m) 


> 1 - 8 . 


Corollary 22 (McAllester, 1999, 2003a) For any distribution D on Ax{—1,1}, for any 
set Ft of voters X -A [—1,1], for any prior distribution P on Ft, and any 5 £ (0,1], we have 


Pr 

S~D m 


! For all posteriors Q on Ft : 
Rd(Gq) < Rs(Gq ) + 


1 

2 m 


KL(Q||P) + ln 




> 1 - 8 . 


Proof The result is obtained from Corollary 21 together with Pinsker’s inequality 

2 (q-pf < k%|b). 

We then have 

( For all posteriors Q on Ft : 

2 ■(R s (G Q )-R D (G Q )y < 1 

The result is obtained by isolating Rd(Gq) in the inequality, omitting the lower bound of 
Rd(Gq). Recall that the probability is “ > 1—5”, hence if we omit an event, the probability 
may just increase, continuing to be greater than 1 — 5. ■ 


KL(Q||P) + ln 


$(m) 


> 1 - 5 . 
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5.2.2 A First Bound for the Risk of the Majority Vote 

Let assume that the Gibbs risk Rd(Gq ) of a classifier is lower than or equal to f. Given 
an empirical Gibbs risk Rs(Gq) computed on a training set of m examples, the Kullback- 
Leibler divergence KL(Q||P), and a confidence parameter 5, Corollary 21 says that the 
Gibbs risk Rd(Gq) is included (with confidence 1—5) in the continuous set 7 Zq s defined as 


S' \ r : kl(H s (Cg) || r) < ±- l 


1 r 


KL(Q||P) + ln 




and 



(23) 


Thus, an upper bound on Rd(Gq) is obtained by seeking the maximum value of 7 Zq s . As 
explained by Proposition 10, we need to multiply the obtained value by a factor 2 to have 
an upper bound on Rd(Bq). This methodology is summarized by PAC-Bound 0. 

Note that PAC-Bound 0 is also valid when R,£>{Gq ) is greater than ^, because in this 
case, 2 • sup 7 Zq s = 1 (with confidence at least 1 — 5), which is a trivial upper bound 
of R d (Bq). 


PAC-Bound 0 For any distribution D on X x{— 1,1}, for any setR of voters X -A [—1,1], 
for any prior distribution P on R, and any 5 6 (0,1], we have 


Pr 

S~D m 


onR : Rd{Bq) < 2 -sup 1Z 



> 1-5. 


Proof If sup TZq s = the bound is trivially valid because Rd(Bq) < 1. Otherwise, the 
bound is a direct consequence of Proposition 10 and Corollary 21. ■ 

As we see, the proposed bound cannot be obtained by a closed-form expression. Thus, we 
need to use a strategy as the one suggested in the following. 

5.2.3 Computation of PAC-Bound 0 

One can compute the value r = sup 7 Zq s of PAC-Bound 0 by solving 

kl (Rs(Gq) || r) = ± [KL(Q||P) + In , with R S (G Q ) < r < I , 

by a root-finding method. This turns out to be an easy task since the left-hand side of 
the equality is a convex function of r and the right-hand side is a constant value. Note 
that solving the same equation with the constraint r < Rs(Gq ) gives a lower bound of 
Rd(Gq), but not a lower bound on Rd(Bq). Figure 4 shows an application example of 
PAC-Bound 0. 

5.3 Joint Error, Joint Success, and Paired-voters 

We now introduce a few notions that are necessary to obtain new PAC-Bayesian theorems 
for the C-bound in Sections 5.4 and 5.5. 
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Figure 4: Example of application of PAC-Bound 0. We suppose that KL(Q||P) = 5 , m = 
1000 and <5 = 0.05. If we observe an empirical Gibbs risk Rs(Gq ) = 0.30, then 
Rd{Gq) € TZq s ~ [0.233,0.373] with a confidence of 95%. On the figure, the 
intersections between the two curves correspond to the limits of the interval TZq s . 
Then, with these values, PAC-bound 0 gives Rd(Bq) < 2 • 0.373 = 0.746. 


5.3.1 The Joint Error and the Joint Success 

We have already defined the expected disagreement dg of a distribution Q of voters (Defi¬ 
nition 7). In the case of binary voters, the expected disagreement corresponds to 

d Q = , E „, E „(, E 7(M*)/ M*))) ■ 

^ h\~Q h 2 ~Q / 

Let us now define two closely related notions, the expected joint success Sq and the expected 
joint error c'q . In the case of binary voters, these two concepts are expressed naturally by 


e Q = h E n h E o (/ E I(h l (x)^y)I(h 2 (x)^y)), 

S Q = , E „ k E „C E Hh 1 (x) = y)HWx)=yj). 

^ hi~Q h 2 ^Q \(x,y)~D' / 

Let us now extend in the usual way these equations to the case of real-valued voters. 

Definition 23 For any probability distribution Q on a set of voters, we define the expected 
joint error Cq relative to D' and the expected joint success Sq relative to D' as 


P D> 4ef 

e Q ~ 


E E 

/i~Q /2~Q 


„d' def 

S Q ~ 


E E 

/i~Q / 2 ~Q 


( E Ci(fi{x),y) ■ C £ (f 2 (x),y)) , 

\(x,y)~D' J 

( E [l - Ce(fi(x),y)] • [l - C e (f 2 (x), y)Y) . 

\(x,y)~D' L J L J / 
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From the definitions of the linear loss (Definition 2) and the margin (Definition 8), we 
can easily see that 


e 


D' 

Q 


s 


D' 

Q 


E 

(x,y)~D' 


E 

(x,y)~D’ 


^1 — M Q (x,y)\ 2 
fl + M Q {x,y)\ 2 


^ — 2 • hi(Mq) + /x 2 (Mq )^ , 

- ^1 + 2 • (ii (Mq ) + ^2(Mq )) ■ 


Remembering from Equation (9) that dq = ^ ^1 — fi 2 (Mq)J, we can conclude that e'q , 
Sq and d,Q always sum to one: 9 


*Q + S Q + d Q = 1 


We can now rewrite the first moment of the margin and the Gibbs risk as 

= 4- e Q = 1 -(2e§'+d3'), 

Rd'(Gq) = 3 (1 - Sq + eg') = 3 (2eg +dg). 

Therefore, the third form of C-bound of Theorem 11 can be rewritten as 

. 2 


Cq = i- 


(l-P <+<)) 

-i .in f 


1 - 2d Q 


(24) 


(25) 


5.3.2 Paired-Voters and Their Losses 

This first generalization of the PAC-Bayesian theorem allows us to bound separately either 
dg, or Sq, and therefore to bound Cq. To prove this result, we need to define a new 
kind of voter that we call a paired-voter. 


Definition 24 Given two voters /, : X -* [—1,1] and fj : X — > [—1,1], the paired-voter 
fij : X —>■ [— 1, l] 2 outputs a tuple: 


fij(x) = < fi{x),fj{x )) . 

Given a set of voters % weighted by a distribution Q on T-L, we define a set of paired-voters Ti 2 
weighted by a distribution Q 2 as 

= f {/o : ./;,./)■ GR}, and Q 2 (./),) d = Q(/<) • Q(/,) • (26) 

We now present three losses for paired-voters. Remember that a loss function has the 
form V x V -» [0,1], where V is the voter’s output space. As a paired-voter output is a 

9. This is fairly intuitive in the case of binary voters. Indeed, given any example [x, y) and any two binary 
voters h\,h 2 , we have either: both voters misclassify the example - i.e., h\(x) = /jRx) ^ y —, both voters 
correctly classify the example - i.e., hi(x) = /iRa:) = y -, or both voters disagree - i.e., hi(x) 7 ^ h, 2 (x). 
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tuple, our new loss functions map [—1, l] 2 x {—1,1} to [0,1]. Thus, 


£e(fij(x), y ) 

£s{fij(x), y) 

£d{fij(x), y) 


def 


= £e(fi(x),y) • Ce(fj(x),y ), 


def 


def 


1 - £e(fi(x),y) ■ 1-£e(fj(x),y) 


= £e(fi(x)-fj{x), 1 ). 


(27) 


The key observation to understand the next theorems is that the expected losses of 
paired-voters Ft 2 defined by Equation (26) allow one to recover the values of eg', sX and 
cIq . Indeed, it directly follows from Definitions 3, 7 and 23, that 


eg' = E E%1 
Q hj~Q 2 V 


(/fj) > 


sX = E E£, 
y fij~Q 2 D 


(/b) ; d Q - f ® Q 2 E D'(f* 


(28) 


5.4 PAC-Bayesian Theory For Losses of Paired-voters 

As explained in Section 5.2, classical PAC-Bayesian theorems, like Corollaries 21 and 22, 
provide an upper bound on Rd(Gq ) that holds uniformly for all posteriors Q. A bound on 
Rd(Bq) is typically obtained by multiplying the former bound by the usual factor of 2, as 
in PAC-Bound 0. 

In this subsection, we present a first bound of Rd(Bq) relying on the C-bound of Theo¬ 
rem 11. A uniform bound on Cq is obtained using the third form of the C-bound, through a 
bound on the Gibbs risk Rd(Gq ) and another bound on the disagreement dg. The desired 
bound on Rd(Gq) is obtained by Corollary 21 as in PAC-Bound 0. To obtain a bound 
on dg, we capitalize on the notion of paired-voters presented in the previous section. This 
allows us to express two new PAC-Bayesian bounds on the risk of a majority vote, one for 
the supervised case and another for the semi-supervised case. 


5.4.1 A PAC-Bayesian Theorem for eg, sg, or cig 

The following PAC-Bayesian theorem can either bound the expected disagreement dg, the 
expected joint success sg or the expected joint error eg of a majority vote (see Definitions 7 
and 23). 


Theorem 25 For any distribution D on X x{ — 1,1}, for any set Ft of voters X —»• [—1,1], 
for any prior distribution P on Ft, and any 6 6 (0,1], we have 


' For all posteriors Q on Ft : 


Pr 

S~D m 


>0 1 

1 a Q 

) < 1 
m 

e Q ’ 

S Q 

or dg'. 


2-KL(Q||P) + ln 




> 1 - 6 . 


Proof Theorem 25 is deduced from Theorem 20. We present here the proof for ag' = eg'. 
The two other cases are very similar. 

Consider the set of paired-voters Ft 2 and the posterior distribution Q 2 of Equation (26). 
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Also consider the prior distribution P 2 on PL 2 such that P 2 (fij ) = P{f%) ■ P(fj) ■ Then 
we have, 


KL( « 2||pa) - 


In 


Qifi ) 
P(f t ) 

= 2-KL(Q||P). 


= E 

fij~Q 2 


= E InM) 
fij~Q 2 P(fi ) • P(/i) 


■In 


Q(/i 


P(fj 


Finally, from Equation (28), we have E (/i? ) = eR and E E< 

fa~Q 2 v v ‘ ‘ 

Hence, by applying Theorem 20, we are done. 


/i,~Q 2 


(•Aj 


— pS 

- e Q' 


5.4.2 A New Bound for the Risk of the Majority Vote 

Based on the fact that Theorem 25 gives a lower bound on the expected disagreement dq, we 
now derive PAC-Bound 1, which is a PAC-Bayesian bound for the C-bound, and therefore, 
for the risk of the majority vote. 

Given any prior distribution P on H, we need the interval 7 Zq s of Equation (23), 
together with 


®Q, S = { d ■ kl(d|ll i) < 4 


2-KL(Q||P) + ln 


CM 


(29) 


We then express the following bound on the Bayes risk. 


PAC-Bound 1 For any distribution D on X x {—1,1}, for any sePH of voters X —»• [—1,1], 
for any prior distribution P on Pi, and any 6 6 (0,1], we have 


( 


Pr 

S~D m 


V Q on PL : Rd^Bq) < 1 — 


1 — 2•sup 7T 


5/2 

Q,S 


V 


1- 2-inf P' 


, 5/2 

Q,S 


> 1 - 6 , 


where Pq 2 s and Pq 2 s are respectively defined by Equations (23) and (29). 

Proof By Proposition 9, we have that dq < \. This, together with the facts that m is 

finite and dq £ T>q s , implies that inf T>q 2 s < and therefore that the denominator of the 
fraction in the statement of PAC-Bound 1 is always strictly positive. 

Necessarily, supP-Ag < |. Let us consider the two following cases. 

Case 1: sups = b- Then, 1 — 2 • swplZq 2 ^ = 0, and the bound on Rr>(Bq) is 1, which 
is trivially valid. 

Case 2: sup Rq '^ < Then, we can apply the third form of Theorem 11 to obtain the up¬ 
per bound on Rd(Bq). The desired bound is obtained by replacing dq by its lower bound 
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C jc\ C Ic\ 

inf T>q S , and Rd(Gq), by its upper bound sup R.X s . The two bounds can therefore be 
deduced by suitably applying Corollary 21 (replacing 6 by 5/2) and Theorem 25 (replacing 
chq by cIq, aq by dq and 5 by 5/2). ■ 


This bound has a major inconvenience: it degrades rapidly if the bounds on the nu¬ 
merator and the denominator are not tight. Note however that in the semi-supervised 
framework, we can achieve tighter results because the labels of the examples do not affect 
the value of dq (see Definition 7). Indeed, it is generally assumed in this framework that 
the learner has access to a huge amount m! of unlabeled data (i.e., m' 3> m). One can then 
obtain a tighter bound of the disagreement. In this context, PAC-Bound 1’ stated below is 
tighter than PAC-Bound 1. 


PAC-Bound 1’ (Semi-supervised bound) For any distribution D on Ax{ — 1,1}, for 
any set Ft of voters X -* [—1,1], for any prior distribution P on Ft, and any 5 G (0,1], we 
have 


( 


Pr 

S U ~D 


yQ on Ft : Rf)(Bq) < 1 — 


1 — 2 • sup 77j 


S/2 V 
Q,s) 


A 

unlabeled 


1 - 2 • inf V, 


< 5/2 

Q,Su 


> 1-5. 


Proof In the presence of a large amount of unlabeled data (denoted by the set Su), one 
can use Corollary 25 to obtain an accurate lower bound of dq. An upper bound of Rd(Gq) 
can also be obtained via Corollary 21 but, this time, on the labeled data S. Thus, similarly 
as in the proof of PAC-Bound 1, the result follows from Theorem 11. ■ 


5.4.3 Computation of PAC-Bounds 1 and 1’ 

To compute PAC-Bound 1, we obtain the values of r = sup77 q 2 , and d = inf F)q 2 g by 
solving 

kl(R s (G Q ) || r) = i[KL(Q||P)+ln^], with R s (Gq) < r < \ , 
and kl(d^||d) = A [2 • KL(Q||P) + In , with d < d s Q . 

These equations are very similar to the one we solved to compute PAC-Bound 0, as described 
in Section 5.2.2. Once r and d are computed, the bound on Rf)(Bq) is given by 1— ^ 1 1 _ 2 2 r J . 

The same methodology can be used to compute PAC-Bound 1’, except that in the 
semi-supervised setting, the disagreement is computed on the unlabeled data Su. 

5.5 PAC-Bayesian Theory to Directly Bound the C-bound 

PAC-Bounds 1 and 1’ of the last section require two approximations to upper bound Cq : 
one on Rd(Gq) and another on dq. We introduce below an extension to the PAC-Bayesian 
theory (Theorem 28) that enables us to directly bound Cq . To do so, we directly bound any 
pair of expectations among eq, Sq and dq. For this reason, the new PAC-Bayesian theorem 
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is based on a trivalent random variable instead of a Bernoulli one (which is bivalent). Note 
that Seeger (2003) and Seldin and Tishby (2010) have presented more general PAC-Bayesian 
theorems valid for £;-valent random variables, for any positive integer k. However, our result 
leads to tighter bounds for the k = 3 case. 

Before we get to this new PAC-Bayesian theorem (Theorem 28), we need some prelimi¬ 
nary results. 

5.5.1 A General PAC-Bayesian Theorem for Two Losses of Paired-Voters 

Theorem 26 below allows us to simultaneously bound two losses of paired-voters. This 
result is inspired by the general PAC-Bayesian theorem for real-valued losses (Theorem 18). 


Theorem 26 For any distribution D on X x{ — 1,1}, for any set Ft of voters X —> [—1,1], 
for any two losses C ai Cp : [-M] x { — 1,1} —> [0,1] with a,/3 6 {e, s,d}, for any prior 
distribution P on Ft, for any 8 G (0,1], for any m! > 0, and for any convex function 
P(qi,q 2 \\pi,P 2 ), we have 


(For all posteriors Q on Ft : 


Pr 

S~D m 


v {„, V s *" F) (ft) V,,v V (ft)) 


< — 
m! 


2 • KL(Q||P) + In ( % 
8 


>1-5. 


def 


where H = E E e 


m'-V “ (/y ) , E C / (/y) 11 e£“ (/y ) , (/y) ) 


Proof To simplify the notation, first let af- '= E fy (fij) and (if- = r E 'fy(fij). 

Now, since Ey.. ^ P 2 e m ' V I1 a > ) is a positive random variable, Markov’s inequal¬ 

ity (Lemma 46, in Appendix A) can be applied to give 


def 


Pr 

S~D’ 


f E e m '' v ( a v >f) <Ahv’ /3 v) < 1 E E e ro '' x ’( a «’ /3 5 |J a °K)\ > 1 - 5 

\fij~P 2 5 s~d™ fij ^ P 2 J 


By exploiting the fact that ln(-) is an increasing function, and by the definition of H, we 
obtain 


Pr ( In 

-rp m' -T> (a?. II &P. ,BP. ) 

J ±J C v % 3 II z 3 ^ z 3 / 


Jij~P 2 



>1 — 5. 


(30) 


We apply the change of measure inequality (Lemma 17) on the left side of innermost in¬ 
equality, with (/>(/) = m! • V (o>, /??■ || aP,/?P), P = P 2 and Q = Q 2 . We then use Jensen’s 
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inequality (Lemma 47, in Appendix A), exploiting the convexity of V : 


In 


E 

m 

e 

■via?- 

\ IJ 

IJ \ 

I OtP- ,0P') 


■i~P 2 




_ 


> 

m! E 

fij~Q 2 

/ 

V{a 

?. 8?. II a.?. 
13 II 0 

> 

/ 

m • 

V 

E 

af,, E 




\/o~Q 2 

13 fa- 

Q 2 


= m'-V E a?,, E ft? 

fij~Q 2 3 /...-'A 


11 r>2\ 


E E 

0 3 M fa~Q 2 


E aPj, E 0P. 


-kl(q 2 \\p 2 ) 

-2-KL(Q||P) 


The last equality KL((5 2 ||P 2 ) = 2 • KL(Q||P) has been shown in the proof of Theorem 25. 
The result can then be straightforwardly obtained by inserting the last inequality into Equa¬ 
tion (30). ■ 


5.5.2 A PAC-Bayesian Theorem for Any Pair Among eg, sg, and dg 

In Section 5.1, Theorem 20 was obtained from Theorem 18. Similarly, the main theorem of 
this subsection (Theorem 28) is deduced from Theorem 26. However, a notable difference 
between Theorems 20 and 28 is that the former uses of the KL-divergence kl(-1j •) between 
distributions of two Bernoulli (i.e., bivalent) random variables, and the latter uses the 
KL-divergence kl(-, u|-, •) between distributions of two trivalent random variables. 

Given two trivalent random variables Y q and Y p with P(Y q = a) = q\. P(Y q = b) = q 2 , 
P{Y q = c) = l-qi-q 2 , and P(Y p = a) = pi, P{Y p = b) = p 2 , P(Y p = c) = 1 -pi-p 2 , we 
denote by kl(gi,g 2 \\PhP 2 ) the Kullback-Leibler divergence between Y q and Y p . Thus, we 

kl(gi ,?2 \\pi,P 2 ) = f gi In — + g 2 ln—+ (1 - gi - ®)ln |-—-— • (31) 

Pi P2 1 - Pi - P2 

Note that kl(gi,g 2 \\PhP 2 ) is a shorthand notation for KL(Q||P) of Equation (20), with 
Q = (gi, g 2 ,1—gi—g 2 ) and P = (p\,P 2 , 1— pi— P 2 )- Corollary 50 (in Appendix A) shows that 
kl(gi,g 2 || Pi , P 2 ) is a convex function. 

To be able to apply Theorem 26 with D(gi, g 2 || pi,pz) = kl(gi, g 2 ||pi,p 2 ), we need 
Lemma 27 (below). This lemma is inspired by Lemma 19. However, in contrast with 
the latter, which is based on Maurer’s lemma, Lemma 27 needs a generalization of it to 
trivalent random variables (instead of bivalent ones). The proof of this generalization is 
provided in Appendix A, listed as Lemma 52. 


Lemma 27 For any distribution D on Xx { — 1,1}, for any paired-voters fij, and any 
positive integer m, we have 


hi e v 



< £ ( m ) + m, 


where C a and Cp can be any two of the three losses C s , C e or Cd, and where £(m) is defined 
at Equation (22). Therefore, m + y/m, < £(m) + m < m-\-2y/m. 
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Proof Let Y t] be a random variable that follows a multinomial distribution with three 
possible outcomes: a — (1,0), b = (0,1) and c = (0,0). The “ Trinomial ” distribu¬ 
tion is chosen such that Pr {Y t] = a) = E^,'* (fij), Pr (Yjj = b) = E (fij) and Pr (Yjj=c) = 
1 — E„“ (fij) — EGiven m trials of Yij, we denote Y)“, Y^ and Y£ the number 
of times each outcome is observed. Note that Y t j is totally defined by (Y^,Y^), since 
Yf = m — Y- a - — Yf. We thus use the notation 

L J L J L J 

Y l3 = (Y£,Y§) ~ Tij d = Trinomial(m,E^(/ 4i ),E^(/ 4J )). 


Hence, we have 


(Y% = k 1 AYf j =k 2 ) = 0{ m i a kl )[Ej"(f ij )] kl [Eg 
for any k\ G {0, ..,m} and any k 2 G {0, ..,m — k{\. 


Pr 

(Y a .,Y b .)~Ti 

\ % 7 ’ * 7 / 4 


m—k\ — k2 


Now, applying Lemma 52 to the convex function e 
definition of kl(•), we have 


,-kl 


I e d“(/«)> e z/(/<j)) ) an d by the 


E e 

S~D™ 


m ■ kl(E§“ vf/ (/,,) E£“ (f ^),(/„)) 

| *£“(/«),!#’(/y)) 


< E e 
( y i“Ay)~'7y 


E 


i ya 
m ij 


Y- a - 


J_yb 

m ij 


YY 


( Y i a j’ Y ij)~ T ij V E s“(/d)/ ' 


E 7(/u), 


-j _)L_ ya 1 yb 

L m 1 ij m 1 ij 

1 — Eg"* (fij) — (fij ) / 


m-YD-Yh 


As Yij follows a trinomial law, we then have 


E 


1 ya 
m ij 


Y?. 


(Yy'V-Tu VE Yifij), 


Yyp 
m ij 

pi'Ua 


-j J _ya 1 yb 

L m 1 ij m 1 ij 

,1-E ^(fij)(fij), 


%-Y^-YY 


m m—k i 

= E E 

ki—0 /c2—0 


m m—ki 

= E E 

ki —0 /c2—0 


Pr 

( Y i a v Y f)~Y 


(Yfj = k! A Y* = k 2 ) 


k"2 


k2 


2 fcj k 2 

m m 


m—k\—k2 


E s (/o). 


Es^/o), 


m\ / m—ki 


k2_ 


i-K a (fij)-K ,, (fij)J 

(E^v^y 1 

k 2 / l _kl_k 2 \ m - fc l- fe 2 


m m 


Ef“(/o)7 VE^(/o)/ V 1 -Ef“(/o-) -(/«), 


m m—k i 

- E E 

/Cl —0 /C 2—0 

= £ ( to ) + TO . 


to\ /"m — ki 
ki 


TO 


TO 


&2 


- - 1---- 


ki 


TO TO 


m—k\—k2 


The last equality has been proven by Younsi (2012). Recall that f(rn) is defined by Equa¬ 
tion (22). ■ 
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We are now ready to present the main result of this section. By bounding any pair of 
expectations among eg, sg and dg, Theorem 28 is the perfect tool to directly bound the 
C-bound. 


Theorem 28 For any distribution D on X x{ — 1,1}, for any set Ft of voters X —> [—1,1], 
for any prior distribution P on Ft, and any 6 € (0,1], we have 

For all posteriors Q on Ft : 

£{m) + m 


S ^ m [ kl ( a Q > @Q 11 a Q ’ @Q ) ^ h 


2-KL(Q||P) + ln 


> 1-5. 


where ag' and f3q can be any two distinct choices among dg 7 , eg' and Sq . 

Proof The result follows from Theorem 26 with T>(q\ , (J 2 \\puP 2 ) = kl(c/i ’ Q '2 \\phP 2 ) and 

m! = m. Since Equation (28) shows that a’X = E aff and BX = E Bf ', we have 

Q fii~Q 2 lJ Q fij~Q 2 3 


,Pg m Q on"H : kl (ag, /?g || ag, /3g) < 


1 

m 


1 


2 • KL(Q 11 P) + In I g E E e mkl ( a u >/3 u I1 At ) 

\Ss~d™ fij ~pi 


J >1-5. 

As the prior distribution P 2 is independent of S, we can swap the two expectations in 
expression E E ^ e mkl (“,® A® || This observation, together with Lemma 27, gives 

E E e ^ kl (< Aj \\ a ?j’Pij) = E E e mkl ( a ^'A§ || a ij’Pij) 

S~D™ fij~P 2 fij~P 2 S~D m 

< E £(m) + m 

fn-P 2 

= £ (m) + m . 


A first version of Theorem 28 was proposed by Lacasse et al. (2006), with the differ¬ 
ence that In in the latter is now replaced by In in the former. Since 

£(m) + m < ( m + 1 K m + 2 ) ^ the new theorem is therefore tighter. 


5.5.3 Another Bound for the Risk of the Majority Vote 


First, we need the following notation that is related to Theorem 28. Given any prior 
distribution P on Ft, 


A 5 

A Q,s 


def 


= Ud,e) 


kl (d s 




d,e) < — 
m 


2-KL(Q||P) + ln g(m j +m 


(32) 


The bound is obtained by seeking the point of Aq s maximizing the C-bound. Since a 
point (d, e) of Aq s expresses a disagreement d and a joint error e, we directly compute the 
bound on Cq using Equation (25). 
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Note however that Aq s can contain points that are not possible in practice, i.e., points 
that are not achievable with any data-generating distribution D. Indeed, by Proposition 9, 
we know that 

d° Q < 2.R D (G Q )-(1-R D (G Q )). 

Based on this property, it is possible to significantly reduce the achievable region of Aq s . 
To do so, we must first rewrite this property based on dq and eq only. 

d° Q < 2.R D (G Q )-{1-R D (G Q )) = 2 ■ (eg + \d%) • (i - (e§ + *d§)) 

^ 0 < — 5(^§) 2 — 2 e Q ' d-Q + 2eg — 2(eg) 2 

^ — ^'(\/^ — e o) ' (^ 3 ) 

Note also that if Rd(Gq) > g, there is no bound on Rd(Bq) better than the trivial one 
Rd(Bq) < 1. We therefore consider only the pairs (ri, e) £ Aq s that do not correspond to 
that situation. Since Rd{Gq) = \(2eq + dq) (Equation 24), this is therefore equivalent to 
considering only the pairs (d,e) such that 2e + d < 1. We later show that this still gives a 
valid bound. Thus, from all these ideas, we restrain Aq s (Equation 32) as follows: 

Aq s = |(d, e) £ Aq S ■ d < 2(y/e — e) and 2e + d<l|, (34) 

and obtain the following bound that, in contrast with PAC-Bound 1, directly bounds Cq. 

PAC-Bound 2 For any distribution D on X x {—1,1}, for any set/H of voters X —»• [—1,1], 
for any prior distribution P on Ft, and any S £ (0,1], we have 

( \ (l - (2e + d)) 2 1\ 

Pr VQ on Ft : R D (Bq ) < sup l-± ---—> l~S. 

S ~ D \ 1 ~ 2d \) 

Proof We need to show that the supremum value in the statement of PAC-Bound 2 is a 
valid upper bound of Rd(Bq). Note that if Aq 5 = 0, then the supremum is +oo, and the 
bound is trivially valid. Therefore, we assume below that Aq s is not empty. 

Let us consider (d, e ) £ Aq s . From the conditions d < 2(y / e — e) and 2e + d < 1, it 
follows by straightforward calculations that d < \. This implies that 

(l-(2e + d)) 2 

1-2 d ' 

because both the numerator and the denominator of the fraction are strictly positive (re¬ 
member that 2e + d < 1). Thus, the supremum is at most 1. 
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Let us consider the three following cases. 

Case 1: The supremum is not attained in Aq s . Note that as Aq s is a subset of M 2 , the 
supremum must be attained for a pair in the closure of Aq s . The latter is not a closed set 
only because of its 2 e + d < 1 constraint. Therefore, the supremum is achieved for a pair 
(d, e) in the closure for which 1 — (2e + d) = 0, implying that the value of the supremum is 
in that case 1, which trivially is a valid bound for Rd(Bq). 

Case 2: The supremum is attained in Aq s and has value 1. In that case, the bound is 
again trivially valid. 

Case 3: The supremum is attained in Aq s and has a value strictly lower than 1. In 
that case, there must be an e > 0 such that 2 e + d < 1 — e for all (d, e) £ Aq s . Hence, 
because of Equation (33) and Theorem 28, we have that 2eg + dg < 1 — e with probability 
1-5. Since R d (Gq) = \{2eQ + dg) (Equation 24), this implies that, with probability 1—5, 
Rd(Gq) < 1/2 — l/2e. Hence, with probability 1—5, Theorem 11 is valid - i.e., Cq bounds 
Rd(Bq ) - and (dg,eg) £ Aq >s . Thus, 


Rd(Bq) < = 1- 


^ ( 1 ( 2e Q + ^g)) 


1-2 d° Q 


S sup 

(d,e)(E.A.Q g 


1 — (2e + d)^ 
1 - 2d 


and we are done. 


In some situations, we can slightly improve PAC-Bound 2 by bounding the joint error eg 
via Theorem 25 with 5 replaced by 5/2. This removes all pairs (d, e) such that e does not 
belong to the set £q 2 s defined as 

= {e: kl(e|||e) < i[2.KL(Q||P) + ln^l]}. 

Then, by applying PAC-Bound 2, with 5 replaced by 5/2, one can obtain the following 
slightly improved bound. 

PAC-Bound 2’ For any distribution D on X x { — 1,1}, for any set FL of voters X -» [—1,1], 
for any prior distribution P on TL, and any 5 £ (0,1], we have 

l \ (l - (2e + d)) = 15 

TV H : - Rd(b « ) s 1 -TYJ- > i - 

y (<2,e)e-V,s [ \) 

where 

Aq 2 s |(d,e) £ Aqs '■ d — 2(\/e — e), 2e + d < 1 and e < sup£’g / || . (35) 

Proof Immediate consequence of Theorem 25, PAC-Bound 2, and the union bound. ■ 
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(a) Contour plot of kl(0.4,0.1||d, e). 


(b) Contour plot of F c ( d , e). 


Figure 5: Example of application of PAC-Bound 2. We suppose that KL(Q||P) = 5 , m = 
1000 and 5 = 0.05. If we observe an empirical joint error eg = 0.10 and an 
empirical disagreement dg = 0.40 (thus, a Gibbs risk Rs(Gq) = 0.1 + | • 0.4 = 
0.30), then we need to maximize the function F c (d, e) over the domain Aq s given 

by three constraints: kl(0.4,0.11| d, e) < ^ [2KL(Q||P)+ln ~ 0.0199 (blue 

oval), d < 2(y / e—e) (black curve) and 2 e+d < 1 (black dashed line). Therefore, we 
obtain a bound Rd(Bq) < 0.679 (corresponding to the green diamond marker). 


5.5.4 Computation of PAC-Bounds 2 and 2’ 

Let us consider the C-bound as a function F c of two variables (d, e ) E [0, x [0,1], instead 
of a function of the distribution Q. 


F c (d,e) 


def [l-(2 e + d)] 2 

1 - 2d 


(36) 


Proposition 54 (provided in Appendix A) shows that F c is a concave function. There¬ 
fore, PAC-Bound 2 is obtained by maximizing F c (d,e) in the domain Aq s (Equation 34), 
which is both bounded and convex. Several optimization methods can achieve this. In our 
experiments, we decompose F c (d,e) in two nested functions of a single argument: 


sup 

(d,e)GAQ s 


(d, e ) 


sup 

d:(d,-)^AQs 


(d) 


where 


"(d) = f 


sup 


F c (d, e) 


e:(d,e)eAq s 


Thus, we implement the maximization of F c using a one-dimensional optimization algorithm 
twice. Figure 5 shows an application example of PAC-Bound 2. 

The computation of PAC-Bound 2’ is done using the same method, but we optimize 
over the domain Aq$ (Equation 35) instead of Aq s , which is also bounded and convex. 

r /O 

Of course, this requires computing sup £q s beforehand, using the same technique as for 
PAC-Bounds 0, 1 and 1’. Figure 6 shows an application example of PAC-Bound 2’. 
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Figure 6: Example of application of PAC-Bound 2’. We use the same quantities as for 
Figure 5. The red vertical line corresponds to the upper bound on the joint error, 
resulting in an improved bound of Rd{Bq ) < 0.660 (corresponding to the red star 
marker). Note however that, even if the bound here is tighter, the egg-region is a 
bit bigger than in the case of PAC-Bound 2 because all the 5 has been replaced 
by 5/2. 


5.6 Empirical Comparison Between PAC-Bounds on the Bayes Risk Rd(Bq) 

We now propose an empirical comparison of all PAC-Bounds we presented so far. The 
numerical results of Figure 7 are obtained by using AdaBoost (Schapire and Singer, 1999) 
with decision stumps on the Mushroom UCI data set (which contains 8124 examples). This 
data set is randomly split into two halves: one training set S and one testing set T. For 
each round of boosting, we compute the usual PAC-Bayesian bound of twice the Gibbs risk 
(PAC-Bound 0) of the corresponding majority vote classifier, as well as the other variants 
of the PAC-Bayesian bounds presented in this paper. 

We can see that PAC-Bound 1 is generally tighter than PAC-Bound 0, and we obtain 
a substantial improvement with PAC-Bound 2. Almost no improvement is obtained with 
PAC-Bound 2’ in that case. We can also see that using unlabeled data to estimate cIq helps, 
as PAC-Bound 1’ is the tightest. 10 

However, we see in Figure 7 that after 8 rounds of boosting, all the bounds are degrading 
even if the value of Cq continues to decrease. This drawback is due to the fact that the 
denominator of Cq tends to 0, that is the second moment of the margin /J 2 (Mq) is close 
to 0 (see the first or the second forms of Theorem 11). Hence, in this context, the first 
moment of the margin /ri(Mg) must be small as well. Thus, any slack in the bound of 
Hi (Mq) has a multiplicative effect on each of the three proposed PAC-bounds of Rd(Bq). 
Unfortunately, Boosting algorithms tend to construct majority votes with hi(Mq) just 
slightly larger than 0. 


10. To obtain PAC-Bound 1’, we simulate the case where we have access to a large number of unlabeled 
data by simply using the empirical value of dg computed on the testing set. 
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Figure 7: Comparison of bounds of Rd(Bq) during 60 rounds of Boosting. 


6. PAC-Bayesian Bounds without KL 

Having PAC-Bayesian theorems that bound the difference between Cq and Cq opens the 
way to structural C-bound minimization algorithms. As for most PAC-Bayesian results, the 
bound on Cq depends on an empirical estimate of it, and on the Kullback-Leibler divergence 
KL(Q\\P) between the output distribution Q and the a priori defined distribution P. In this 
section, we present a theoretical extension of our PAC-Bayesian approach that is mandatory 
to develop the Cq -minimization algorithm of Section 8. 

The next theorems introduce PAC-Bayesian bounds that have the surprising property of 
having no KL term. This new approach is driven by the fact that our attempts to construct 
algorithms that minimize any of the PAC-Bounds presented in the previous section ended 
up being unsuccessful. Surprisingly, the KL-divergence is a poor regularize!' in this case, as 
its empirical value tends to be overweighted in comparison with the empirical value of the 
C-bound (i.e., Cq). 

There have already been some attempts to develop PAC-Bayesian bounds that do not 
rely on the KL-divergence (see the localized priors of Catoni, 2007, or the distribution- 
dependent priors of Lever et al., 2013). The usual idea is to bound the KL-divergence via 
some concentration inequality. In the following, the KL term simply vanishes from the 
bound, provided that we restrict ourselves to aligned posteriors , a notion that is properly 
defined later on in this section. The fact that these new PAC-Bayesian bounds do not 
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contain any KL divergence terms indicates that the restriction to aligned posteriors has 
some “built in” regularization action. 

The following theory is similar to the one used by Germain et al. (2011), in which two 
learning algorithms inspired by the PAC-Bayesian theory are compared: one regularized 
with the KL divergence, using a hyperparameter to control its weight, and one regularized by 
restricting the posterior distributions to be aligned on the prior distribution. Surprisingly, 
the latter algorithm uses one less parameter, and has been shown to have an as good 
accuracy. 

6.1 Self-Complemented Sets of Voters and Aligned Distributions 

In this section, we assume that the (possibly infinite) set of voters % is self-complemented 11 . 

Definition 29 A set of voters % is said to be self-complemented if there exists a bijection 
c\T~i -A T-L such that for any / G T-L, 


c(f) = -/• 

Moreover, we say that a distribution Q on any self-complemented 77 is aligned on a prior 
distribution P if 

Q(f) + Q(c(f)) = P(/) + P(c(/)), V/eft. 

When P is the uniform prior distribution and Q is aligned on P, we say that Q is 
quasi-uniform. Note that the uniform distribution is itself a quasi-uniform distribution. 

In the finite case, we consider self-complemented sets Ti of 2 n voters X —> y. In this 
setting, for any x € X and any i 6 {1,..., n}, we have that f t + n (x) = —fi(x). Moreover, 
finite quasi-uniform distributions Q is such that for any i € {1,..., n}, 

Q(fi) + Q(fi+n) = -• (37) 

n 

Equation (37) shows that when a distribution Q is restricted to being quasi-uniform, 
the sum of the weight given to a pair of complementary voters is equal to A As Q is a 
distribution, this means that the weight of any voter is lower-bounded by 0 and upper- 
bounded by 1 , giving rise to an L^-norm regularization. Note that, in this context, the 
maximum value of KL(<5||P) is reached when all voters have a weight of either 0 or K 
Indeed, a quasi-uniform distribution Q is such that KL(Q||P) < ra(^)ln(^/^-) = In2. 
Consequently, the value of the KL term is necessarily small and plays a little role in PAC- 
Bayesian bounds computed with quasi-uniform distributions. The following theorems and 
corollaries are specializations that allow to slightly improve these PAC-Bayesian bounds 
by getting rid of the KL term completely. To achieve these results, the associated proofs 
require restrictions on the choice of convex function V and loss function C. 

11. In Laviolette et al. (2011), this notion was introduced as an auto-complemented set of voters. However, 
self-complemented is a more suitable name. Also, note that a similar notion, called a symmetric hypothesis 
class, is introduced in Daniely et al. (2013). 
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6.2 PAC-Bayesian Theorems without KL for the Gibbs Risk 

Let us first specialize Theorem 18 to aligned distributions and linear loss Cg. We first need 
a new change of measure inequality, as this is the part of Theorem 18 where the KL term 
appears. 


Lemma 30 (Change of measure inequality for aligned posteriors) 

For any self-complemented set Ft, for any distribution P on Ft, any distribution Q aligned 
on P, and for any measurable function f> : Ft -> M such that </>(/) = <f(c(f)) for all f £ Ft, 
we have 

E 6(f) < Inf E 
v~” 

Proof First, note that one can change the expectation over Q to an expectation over P, 
using the fact that </>(/) = 4>(c(f)) for any f E Ft, and that Q is aligned on P. 



2- E , 6(f) 
}~Q 


/ df 

Q(f) </>(/) + 

[ df 

Q(c(/))0(c(/)) 

Jh 


Jh 


\ df 

Q(f)Hf) + 

[ df 

QW))<Kf) 

Jh 


Jh 


[ df 
Jh 

( Q(f) + Q(c 

(/») 

ttf) 

[ df 
Jh 

(- P(f) + P(c 

</))) 

Hf) 

[ df 

P(f)<Kf) + 

/ df 

p(cu))m 

Jh 


Jh 


[ df 

P(f)<Kf) + 

[ df 

P(c(f)) <Kc(f)) 

Jh 


Jh 


2- E 

D m ■ 




The result is obtained by changing the expectation over Q to an expectation over P, and 
then by applying Jensen’s inequality (Lemma 47, in Appendix A). 


E </>(/) = E <j>(f) = E lne^ (/) < In ( E e^ (/) 
/~Q f~P f~P \f~P 


Theorem 31 (PAC-Bayesian theorem for aligned posteriors) For any distribution 
D onXx{— 1,1}, any self-complemented set Ft of voters X —> [—1,1], any prior distribution 
P on Ft, any convex function V : [0,1] x [0,1] —> M for which V{q,p) = V(1 — q, 1 — p), for 
any m! > 0 and any 6 G (0,1], we have 


' For all posteriors Q aligned on P : 


Pr 

S~D” 


V(R S (G Q ),R D (G Q )) < F 


In ( - 


A . E 


E e 

f~P 



m'-Vt’Eg 1 (/),E 


1 - 6 . 
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Similarly to Theorem 18, the statement of Theorem 31 above contains a value ml which 
is likely to be set to m in most cases. However, the distinction between m and m! is 
mandatory to develop the PAC-Bayesian theory for sample-compressed voters in Section 7. 
Indeed, in proofs of forthcoming Theorems 39, 41 and 42, we have m! = m — A, where A 
is the size of the voters compression sequence (this concept is properly defined in Section 7). 

Proof The proof follows the exact same steps as the proof of Theorem 18, using the 
linear loss C = Hi and replacing the use of the change of measure inequality (Lemma 17) 
by the change of measure inequality for aligned posteriors (Lemma 30), with 

(/>(/) = ml ■ V ^E^(/), Ef/(/)^ • Note that this function has the required property, as 

p(e£(/),e£(/)) =Z>(l— e£(c(/)), 1 —Eg(c(/))) = p(e£(c(/)),E§(c(/))) . 

The other steps of the proof stay exactly the same as the proof of Theorem 18. ■ 

Appendix B presents more general versions of the last two results. 

Let us specialize Theorem 31 to the case where V(q,p) = kl(g||p). Doing so, we re¬ 
cover the classical PAC-Bayesian theorem (Theorem 20), but for aligned posteriors, which 
therefore has no KL term. 

Corollary 32 For any distribution D on Lx{ — 1,1}, any prior distribution P on a self- 
complemented set R of voters X —>• [—1,1], and any 6 £ (0,1], we have 

( For all posteriors Q aligned < 

k\(R s (G Q )\\R D (G Q )) < 1 

where kl(g||p) and £(m) and defined by Equations (21) and (22) respectively. 

Proof This result follows from Theorem 31 by choosing T>(q,p) = kl(g,p) and ml = m. 
The rest of the proof relies on Lemma 19 (as for the proof of Theorem 20). ■ 


n P : 
£(m) 


In 


> 1 - 6 , 


The following corollary is very similar to the original PAC-Bayesian bound of McAllester 
(2003a), but without the KL term. 


Corollary 33 For any distribution D on X x{—1,1}, any self-complemented set R of vot¬ 
ers X —> [—1,1], any prior distribution P on R, and any 6 £ (0,1], we have 


Pr 


' For all posteriors Q aligned on P : 


\j— 

[ln^l 

V 2 m 

0 


> 1 - 6 . 


Proof The result is derived from Corollary 32, by using 2(q — p) 2 < kl(q||p) (Pinsker’s 
inequality), and isolating Rd(Gq) in the obtained inequality. ■ 
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Unlike Theorem 18, Theorem 31 cannot straightforwardly be used for pairs of voters, as 
we did in the proof of Theorem 25. The reason is that a posterior distribution that is the 
result of the product of two aligned posteriors is not necessarily aligned itself. So, we have 
to ensure that we can get rid of the KL term even in that case. 


6.3 PAC-Bayesian Theorems without KL for the Expected Disagreement dR 

The following theorem is similar to Theorem 31 for aligned posteriors, but deals with paired- 
voters. Instead of the linear loss Ci, we use the loss Cd of Equation (27), which is a linear 
loss defined on a pair of voters. Again, the next two results can be seen as a particular case 
of the two theorems from Appendix B. 

In this subsection, we use the following shorthand notation. Given fj = ( fi,fj) as 
defined in Definition 24, the voters fcj , f t] c and fcjc are defined as 


n-j{x) = f (c{f i )(x)J j (x)), fijc(x ) '= (fi(x) 


fi c j ( a 


c(fj)(x )), and fc^x) d = (c(fi)(x),c(fj)(x)). 


Recall that from Equation (26), we have 77 2 {fj : fi, fj € 77} and Q 2 (fj) == 

Q(f) ■ Q{fj). Similarly, we dehne P 2 (fij ) '= P(fi) ■ P(fj)- Using this notation, let us first 
generalize the change of measure inequality of Lemma 30 to paired-voters. 


Lemma 34 (Change of measure inequality for paired-voters and aligned poste¬ 
riors) For any self-complemented set 77, for any distribution P on 77, any distribution Q 
aligned on P, and for any measurable function f> : 77 2 —> M such that f{fij) = f(fi c j) = 
(/>{fij c ) = Hfi c j c ) f or al1 fij € ?7 2 > we have 


E Wi) < ln( E e«M 


Proof First, note that one can change the expectation over Q 2 to an expectation over P 2 , 
using the fact that f(fij) = 4>{fi c j) = 4>{fij c ) = 4>{fi c j c ) for any fij £ 77 2 , and that Q is 
aligned on P. More specifically, we have the following. 


4- E */„) 


= [dfijQ 2 (fij) <Pifij) + [df, J Q' 2 ‘:fr j)o\f, j) + [df, t tP[f, r )o[f,j. ) + /d/ijQ 2 (/icj=)^(/icjc) 
Jn 2 Jn 2 Jn 2 Jn 2 

= [ dfj Q 2 (fj) fifij) + [ dfj Q 2 (fcj)(j)(fij) + f dfj Q 2 (fjc) f(fj) + I df,j <7 J )olf,,) 

JH 2 JH 2 JU 2 JH 2 


= J djij (Q 2 {fij) + Q 2 (fi‘j) + Q 2 CM + Q' 2 (M)f(fj) 
= j df>j[P~hf,j) + P 2 (ficj) + P 2 (fjc) + P 2 (fi^))Hfij) 


= 4- 


E 

fio~P 2 


f(fij) ■ 
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The result is then obtained by changing the expectation over Q 2 to an expectation over P 2 , 
and then by applying Jensen’s inequality (Lemma 47, in Appendix A). 


, E 4>(hi) = , E *(/«) = , E lne*<« < In ( E 

fij~Q fij~P \fij~P 


Theorem 35 (PAC-Bayesian theorem for paired-voters and aligned posteriors) 

For any distribution D on X x { — 1,1}, any self-complemented set 77 of voters X —> [—1,1], 
any prior distribution P on 7 i, any convex function V : [0,1] x [0,1] —> M for which 
V(q,p) = P(1 — q, 1 — p), for any m! > 0 and any 5 G (0,1], we have 


( For all posteriors Q aligned on P 


Pr 


V 


' D ( d Qi d Q ) - , rr j 


m 


In ( - E E e m '- v ^s d Uii)^D d Uij)) 
5 S~D m 


> 1 - 5 , 


where fij is given in Definition 24, and where P 2 {fij) = P(fi ) ■ P(fj)- 


Proof Theorem 35 is deduced from Theorem 31, by using the change of measure inequality 
given by Lemma 34 instead of the one from Lemma 30, with 4>(fij) = m' ■ T7(E g d {fij), ^‘fffifij))- 
As the loss Cd is such that 

K't= K'Uij), and e£?(. fpj) = Efyifijc) = 1-E% (h), 
we then have that <j)(fij ) has the required property to apply Lemma 34. ■ 


Let us now specialize Theorem 35 to V(q,p) = kl(g||p). 

Corollary 36 For any distribution D on X x{— 1,1}, any self-complemented set 77 of vot¬ 
ers X —>• [—1,1], any prior distribution P on 77, and any 6 G (0,1], we have 


Pr 

S~D m 


For all posteriors Q aligned on P : 


kl(d; 


d o) < - 
m 


ln^M 


>1-5. 


Proof The result is directly obtained from Theorem 35, by choosing T>(q,p) 
The rest of the proof relies on Lemma 19. 


kl (q,p)- 


Similarly as for Corollary 33, we can easily derive the following result. 

Corollary 37 For any distribution D on Jx{—1,1}, for any self-complemented set 77 of 
voters X —> [—1,1], any prior distribution P on 77, and any 5 G (0,1], we have 

( For all posteriors Q aligned on P : 

CMl 


Pr 


1 


>1-5. 
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Proof The result is derived from Corollary 36, by using 2(q — p ) 2 < kl(q||p) (Pinsker’s 
inequality), and isolating dg in the obtained inequality. ■ 


6.4 A Bound for the Risk of the Majority Vote without KL Term 

Finally, we make use of these results to bound Cq - and therefore Rd{Bq) - for aligned 
posteriors Q, giving rise to PAC-Bound 3. Aside from the fact that this bound has no KL 
term, it is similar to PAC-Bound 1, as it separately bounds the Gibbs risk and the expected 
disagreement. This new PAC-Bayesian bound provides us with a starting point to design 
the MinCq leaning algorithm introduced in Section 8. 


PAC-Bound 3 For any distribution D on Ax{ — 1,1}, for any self-complemented set R 
of voters X —> [—1,1], for any prior distribution P on R, and any 5 6 (0, 1], we have 


V Q aligned on P : 


Pr 


jDm \ Rd(Bq) < 1 - 




1-2 -d 


= 1 - 


(tfiY 

P2 


> 1-5. 


where 

def 


r = min R s {Gq) + J ^ 


In SM 
m 5/2 


d d = max 0, d s 0 - J £- 


Q 


In 

111 5/2 


def 


IP = max 0, m (M q ) - J £ 


In 

111 5/2 


def 


Hi = min 1, H 2 {M q ) + J £ 


In SM 
111 5/2 


Proof The inequality is a consequence of Theorem 11, as well as Corollaries 33 and 37. 


(1—2-r) 2 _ (Mj_) 2 


The equality 1 — v 1 _ 9 .J = 1 — 


V2 


is a direct application of Equations (7) and (9). 


PAC-Bound 3’ that is presented at the end of Section 7 accepts voters that are kernel 
functions defined using a part of the training set S. This is unusual in the PAC-Bayesian 
theory, since the prior P on the set of voters has to be defined before seeing the training 
set S. To overcome this difficulty, we use the sample compression theory. 


7. PAC-Bayesian Theory for Sample-Compressed Voters 

PAC-Bayesian theorems of Sections 5 and 6 are not valid when R consists of a set of 

functions of the form •) for some kernel k : X x X —>• [—1,1], as is the case with the 

Support Vector Machine classifier (see Equation 1). This is because the definition of each 
involved voter depends on an example ( Xi,yi ) of the training data S. This is problematic 
from the PAC-Bayesian point of view because the prior on the voters is supposed to be 
defined before seeing the data S. There are two known methods to overcome this problem. 

The first method, introduced by Langford and Shawe-Taylor (2002), considers a surro¬ 
gate set of voters R k of all the linear classifiers in the space induced 7 * * * * 12 by the kernel k. They 

12. This space is also known as a Reproducible Kernel Hilbert Space (RKHS). For more details, see Cristianini 
and Shawe-Taylor (2000) and Scholkopf et al. (2001) 
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then make use of the representer theorem to show that the classification function turns out 
to be a linear combination of the examples, similar to the Support Vector Machine classi¬ 
fier (Equation 1). To avoid the curse of dimensionality, they propose restricting the choice 
of the prior and posterior distributions on T~L k to isotropic Gaussian centered on a vector 
representing a particular linear classifier. Based on this approach, Germain et al. (2009) 
suggests a learning algorithm for linear classifiers that exactly consists in a PAC-Bayesian 
bound minimization. 

The second method, that is presented in the present section, is based on the sample 
compression setting of Floyd and Warmuth (1995). It has been adapted to the PAC- 
Bayesian theory by Laviolette and Marchand (2005, 2007), allowing one to directly deal 
with the case where voters are constructed using examples in the training set, without 
involving any RKHS notion nor any representer theorem. Conversely to the first method 
described above, the sample compression approach allows one not only to deal with kernel 
functions, but with any kind of similarity measure between examples, hence to deal with 
any kind of voters. 

7.1 The General Sample Compression Setting 

In the sample compression setting , learning algorithms have access to a data-dependent set of 
voters, that we refer to as sc-voters. Given a training sequence 13 S = ( ..., (x m , y m ) ), 
each sc-voter is described by a sequence S) of elements of S called the compression sequence , 
and a message a which represents the additional information needed to obtain a voter from 
Si- If i = (ii,i2, then 5) = f ((x h , y h ), (x i2 ,y i2 ), ..., (x ik ,y ik )). In this paper, repeti¬ 

tions are allowed in S), and k , the number of indices present in i (counting the repetitions), 
is denoted by |i|. 

The fact that each sc-voter is described by a compression sequence and a message implies 
that there exists a reconstruction function lZ(Si,cr) that outputs a classifier when given an 
arbitrary compression sequence 5) and a message a. The message a is chosen from the 
set HSi of all messages that can be supplied with the compression sequence S). In the 
PAC-Bayesian setting, Eg. must be defined a priori (before observing the training data) 
for all possible sequences S), and can be either a discrete or a continuous set. The sample 
compression setting strictly generalizes the (classical) non-sample-compressed setting, since 
the latter corresponds to the case where |i| = 0, the voters being then defined only via the 
messages. 

7.2 A Simplified Sample Compression Setting 

For the needs of this paper, we consider a simplified framework where sc-voters have a 
compression sequence of at most A examples (possibly with repetitions) and a message 
string of A bits that we represent by a sequence of “—1” and “+1”. Instead of being defined 
on sc-voters, the weighted distribution Q is defined on Ia x Ea, where 

Z A = f : A;e{0,..,A} and *,■ e {1, ..,m} j and E A '= j-l,lj . (38) 

13. The sample compression theory considers the training examples as a sequence instead of a set, because 

it refers to the training examples by their indices. 
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In other words, Q(i, cr) corresponds to the weight of the sc-voter output by lZ(Si, cr), i.e., the 
sc-voter of compression sequence i = (i\, ..., i|;|) E X\ and message cr = (a i ,..., cr\) E E A . 
In particular, a prior (resp., a posterior) on the set of all sc-voters is now simply a prior 
on the set X\ x E A . Thus, such a prior can really be defined a priori, before seeing the 
data S'. 14 The set of sc-voters is therefore only defined when the training sequence S is 
given, and corresponds to 

Ulx = f {K(Si,<r) : i E Z A , <x £ E A } . 

Finally, given a training sequence S and a reconstruction function 77, for a distribution Q 
on X\ x E A , we define the Bayes classifier as 

B Q)S = f sgn E U(Si,a) . 

(i,<x)~Q 

We then define the Bayes risk Rd'(Bq^s) and the Gibbs risk Rd'(Gq,s) of a distribution Q 
on X\ x E A relative to D' as 

R D '(Bq,s) = E C ^(B Q ,s), 

Rd'(Gq,s) = f E Eg(^(5i,<r)). 

7.3 A First Sample-Compressed PAC-Bayesian Theorem 

To derive PAC-Bayesian bounds for majority votes of sc-voters, one must deal with the 
following issue: even if the training sequence S is drawn i.i.d. from a data-generating distri¬ 
bution D , the empirical risk of the Gibbs Rs^Gq^s) is not an unbiased estimate of its true 
risk Rd(Gq } s)- For instance, the reconstruction function 77 can be such that an sc-voter 
output by 77(S;, a) never errs on an example belonging to its compression sequence Si; this 
biases the empirical risk because examples of Si are all in S. 

To deal with this bias, the ^ factor in the usual PAC-Bayesian bounds is replaced by a 
factor of the form —37 in their sample compression versions. In Laviolette and Marchand 
(2005, 2007), l corresponds to the Q-average size of the sample compression sequence. In 
the present paper, we restrain ourselves to a simpler case, where l is the maximum possible 
size of a compression sequence (i.e., I = A). This simplification allows us to deal with 
the biased character of the empirical Gibbs risk using a proof approach similar to the one 
proposed in Germain et al. (2011). The key step of this approach is summarized in the 
following lemma. 

Lemma 38 Let 77 be a reconstruction function that outputs sc-voters of size at most A 
(where A < m). For any distribution D on X x{— 1,1}, and for any prior distribution P 
on X\ x E a , 

E E e (m-A)-2.(E^(^(S i!( r))-E^TO,<T ))) 2 < 

S~D™ (i,cr)~P V ' ’ 

where £(•) is defined by Equation ( 22 ), and therefore we have that £(m — A) < 2y/m—X . 

14. Laviolette and Marchand (2007) describe a more general setting where, for each S £ (X x T) m , a prior 
is defined on X\ x Egj. Hence, the messages may depend on the compression sequence S). 
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Proof As the the choice of (i, cr) according to the prior P is independent 15 of S', we have 


E E e (™-V- 2 -{ E s e (' R (S i ,'T))-E% (TC(Si,«T))) 2 

S~D™ (i,< t)~P 

E E e (’"- J )' 2 '( E s'( R ( s i-' r ))- I «W s i’”))) 2 

(i, CT )~p 

_ E E E e (m-A)-2-(E^CR(S i ,<T))-E^CR(S i ,o-)))" 


(39) 

(40) 


Let us now rewrite the empirical loss of an sc-voter as a combination of the loss on its 
compression sequence S; and the loss on the other training examples Si#, 




1 




Ci 


Since 0 < (TZ(Si, cr)) < 1 and 2 • (q — p) 2 < kl(g||p) (Pinsker’s inequality), we have 
(m - A) • 2 • (e§* (TZ(Si, a)) - E% (7 Z(S U <r))) ' 


= 

(m — 

A)- 

2-0u a 


+ (to-A) • E^f c (H(S U a))] - E%(n(Si, <r))J 


= 

(m — 

A)- 

2‘(£K 


KstMSu*))] + [E^(^(S i , < r))-E^(^(S i , < r))]) 2 


= 

(m — 

A)- 

2-((s) 2 

[E%(n(Si,<r)) 

- E& (n(Si, <T))] 2 + [Ef5 c (K{Si, *))-E% (H(Su cr))] 2 






+ T7 <T)) - E§* c (K(Su <t))] [E|‘ (7e(S i ,<r))-E^(7e(S i ,« 

'»]) 

< 

(m — 

A)- 

2 ' ((m) 2 

+ [E^(ft(Si, 

a))*))]* + %) 


= 

2 A • ( 

f 

2 - 
V 


2 ) + (m - A) • 

2-[E c s ‘ c (n(S i ,cr))^E c D ‘(n(S i ,cT))] 2 


< 

4 A + 

(to 

-A)-2 - [E c s l(n(S u cr)) 

— E^f (1Z(Si, cr))] 2 


< 

4 A + 

(to 

-A)-kl(Ef‘ e (W,*))l 

\E%(K(Si,*))). 

(41) 


Note that lZ(Si,cr) does not depend on examples contained in Sic. Thus, from the point 
of view of Sic, lZ(Si, a) is a classical voter (not a sample-compressed one). Therefore, one 
can apply Lemma 19, replacing S D m by S^ - D m - A , and / by K(Si,tr). Lemma 19, 
together with Equations (40) and (41), gives 


E 

(i,cr)~P 


E E (77(5;, <x ))) 2 

< e 4A. E E E e (m-A)-kl(E^ c (^(S i ,o-))||E^(^(5 i ,<T))) 

(i,<r)~P S ; ~D a Sic~D m ~ x 


< e 4A - E E £(m-A) = e 4 A -£(m-A), 

(i,<r)~P Si^D x 


and we are done. ■ 

15. Note that because of this independence, the exchange in the order of the two expectations (Line 39) is 
trivial. This independence is a direct consequence of our choice to only consider the simplified setting 
described by Equation (38). In the more general setting of Laviolette and Marchand (2007), this part of 
the proof is more complicated. 
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The next PAC-Bayesian theorem presents the generalization of McAllester’s PAC-Bayesian 
bound (Corollary 22) for the sample compression case. 


Theorem 39 Let Li be a reconstruction function that outputs sc-voters of size at most A 
(where A < m). For any distribution D on Xx{— 1,1}, for any prior distribution P on 
X\ x T,\ , and any 6 G (0,1], we have 


Pr 

S~D m 


For all posteriors Q : 
Rd{Gq,s) < Rs(Gq,s) + 



KL(Q||P) + 4A + ln^p^ 


>1-5. 


Proof We apply the exact same steps as in the proof of Theorem 18, with m' = m — A, 
f = 7Z(Si,cr), and V(q 1 p) = 2(q — p) 2 , we obtain 


Pr 


( For all posteriors Q : 

2 ^s(Gq,s)-R_d(Gq i s)^ 


V 


< 


i—A 


KL(Q||P) + ln( 


E e 

(i 


(m-\)-2- (e^ (1Z(Si. 


r))-E^(TC(Si 


r))Y 


> 1 - 5 . 


The result then follows from Lemma 38 and easy calculations. 


All the PAC-Bayesian results presented in the preceding sections can be similarly gen¬ 
eralized. We leave them to the reader with the exception of the PAC-Bayesian bounds that 
have no KL, that are used in the next section, as we present the learning algorithm MinCq 
that minimizes the C-bound. 


7.4 Sample-Compressed PAC-Bayesian Bounds without KL 

The bounds presented in this section generalize the results presented in Section 6 to the 
sample compression case. We first need to generalize the notion of self-complement (Defi¬ 
nition 29) to sc-voters. 

Definition 40 A reconstruction function 1Z is said to be self-complemented if for any train¬ 
ing sequence S £ (A X y) m and any (i, a) £ X Ej, we have 

-n(Si,a) = n(S u -a), 

where, if a = (a \,.., cr\), then —a = (—a \,.., 


7.4.1 A PAC-Bayesian Theorem for the Gibbs Risk of Sc-Voters 


Theorem 41 Let 1Z be a self-complemented reconstruction function that outputs sc-voters 
of size at most A (where A < m). For any distribution D on X x{—1,1}, for any prior 
distribution P on X\ x , and any 5 G (0,1], we have 


Pr 

S~D m 


For all posteriors Q aligned on P : 


Rd(Gq,s) < Rs{Gq,s ) + 



4A + In 


> 1-5. 
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Proof First note that 2- (q — p) 2 = 2- ((1 — q) — (1 — p)) 2 . Then apply the exact same steps 
as in the proof of Theorem 31 with ml = m — A, f = 1Z{S\, a), and V(q,p) = 2 (q — p) 2 to 
obtain 


Pr 


( For all posteriors Q aligned on P: 
2 (Rs{Gq,s)-Rd(Gq i s)^ < 


to—A 


In I - 


r E 

Ss~D” 


E e v 
(i ^)~p 


(e^ (7?.(Si,<T))-E^ CR(Si,<r))) 2 


> 1 - 5 . 


The result then follows from Lemma 38 and easy calculations. 


7.4.2 A PAC-Bayesian Theorem for the Disagreement of Sc-Voters 

Given a training sequence S and a reconstruction function 77, we define the expected dis¬ 
agreement of a distribution Q on X\ x XA relative to D' as 


ID' 

a Q,S ~ 


E E E £ e (K(Si,<r)(x),K(S v ,<T')(x)) 

x~D x (i,cr)~Q (r,cr)~Q 


E 

(i,i',(T,<T , )~Q 2 


¥, Cd 


' , °" / )) 


where 


Q 2 (i,i^c^,e^ , ) = Q{i, cr) ■ Q(i', a), 

77(5 U /,cr,er , )(x) = f ( 77(S’ i , cr)(x), H(S V , cr')(x) ) . 

Thus, 77 is a new reconstruction function that outputs an sc-paired-voter which is the 
sample-compressed version of the paired-voter of Definition 24. From there, we adapt 
Corollary 37 to sc-voters, and we obtain the following PAC-Bayesian theorem. This result 
bounds dq s for posterior distributions Q aligned on a prior distribution P. 

Theorem 42 Let 77 be a self-complemented reconstruction function that outputs sc-voters 
of size at most A (where A < J)- Tor an U distribution D on X x{ — 1 , 1 }, for any prior 

distribution P on Z\X XA, and any 5 E (0,1], we have 


Pr 

S~D m 


For all posteriors Q aligned on P : 


d D > d s — 
a Q,S — U Q,S 


1 


2(m—2 A) 


8 A + In 


>1-5. 


Proof Let P 2 (i, i', er, er') = P(i, a)-P(i', cr'). Now note that 2- (q—p) 2 = 2-{(l—q) — {l—p)) 2 . 
Then apply the exact same steps as in the proof of Theorem 35 with ml = m — 2A, 
fij = 77 ( 5 * 14 /, a - a ') an d 7 J(q,p) = 2 (q — p) 2 to obtain 


Pr 


For all posteriors Q aligned on P: 

2 1 




-d. 


Q,s) - 


TO 


In 


1 

, E 

ds~D” 


E e m- 2- (e(• TZ(S iy , ct , ct '))-E^ (U(S iy ,<x,<r'))) 
(i,i / ,<y,<y / )~P 2 


> 1-5. 
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Calculations similar to the ones of the proof of Lemma 38 (with A replaced by 2A) give 

S ~ D m (i,i ',< r , cr ')~ P 2 
Therefore, we have 


( For all posteriors Q aligned on P: 
/ \ 2 1 r 

^{^Q,S~^Q,SJ - m _2\ + 


£(ra—2 A) 
5 


> 1 - <5 . 


and the result is obtained by isolating dq s in the inequality. 


7.4.3 A Sample Compression Bound for the Risk of the Majority Vote 


Let us now exploit Theorems 41 and 42, together with the C-bound of Theorem 11, to 
obtain a bound on the risk on a majority vote with kernel functions as voters. Given any 
similarity function (possibly a kernel) k : X x X —>• [—1,1] and a training sequence size 
of m, we consider a majority vote of sc-voters of compression size at most 1 given by the 
following reconstruction function, 


7l k (Si,(a))(x) = 



i f i=(>, 

otherwise ( i = (i) ), 


where i € T\ = {(), (1), (2),..., (m)} and (a) G Ei (thus, a G { — 1,1})- Here, the elements 
of sets 1 1 and Ei are obtained from Equation (38), with A = 1. Note that IZk is self- 
complemented (Definition 40) because (a)) = Tlk{Su {—&)) for any (i, cr). 

Once the training sequence S ~ D m is observed, the (self-complemented) reconstruction 
function IZk gives rise to the following set of 2m+2 sc-voters, 

7^1 = f {&(0> Hxi, •), k(x 2 , •),•••, H x m, ■), -&(■)> - fc Ol, ■)) ~H X 2, •)>•••. ~H x m, ’)} , 

where b : X —> {1} is a “dummy voter” that always outputs 1 and allows introducing a 
bias value into the majority vote classifier. Note that 'bCg\ is a self-complemented set of 

sc-voters, and the margin of the majority vote given by the distribution Q on T-L s \ is 


MQ,s(x,y) d = y ( Q{b(-)) - Q( -&(•)) [Q{k{x ir )) - Q( -k(x h -))} k(x h x) 


i= 1 


Consequently, the empirical first and second moments of this margin are 


Mi ( m q,s) 


^ m 

— 5 Z M Q’S( X ^Vi )> and M2 ( M Q,s) 


i =1 


^ ub 

— Y] Mq s (xi, yi) 

m L 


Hence, the empirical Gibbs risk and the empirical expected disagreement can be expressed 
by 

Rs(G QtS ) = ^ (1 -m( M Q,s)), and d s Q S = ^ (l - fi 2 (M^ s )) . (42) 

Thus, we obtain the following bound on the risk of a majority vote of kernel voters 
Rd(BQ'S) for aligned posteriors Q. 
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PAC-Bound 3 ’ Let k : X x X -» [— 1 , 1 ], For any distribution D on X x{— 1 , 1 }, for any 
prior distribution P on and any 5 € ( 0 , 1 ], we have 


V Q aligned on P : 


° m \ Rd(Bq,s ) < 1 ~ 


1 _ ( 1 - 2 ^ 


1-2 -d 


= 1 - 


Mi) > 1 - 8, 


where 


c- 

f d = f min Rs(Gq,s ) + y^= 1 ) [ 4 + ln ^^rpr 1 }) . 

d d = max ^0, d s Q S - [ 8 + ln ) > 

y± = f max ^0, pi(Mqs) - ^ T [ 4 + ln ) > 

^2 = min ^1, H 2 [ Mq s ) + ^^=2 8 + ln 5( 7/ 2 2) ^ • 

Proof The proof is almost identical to the one of PAC-Bound 3, except that it relies on 
sample-compressed PAC-Bayesian bounds. Indeed, the inequality is a consequence of The¬ 
orem 11, as well as Theorems 41 and 42. The equality 1 — ^ 1 1 _ 2 9 r J = 1 — is a direct 
application of Equation (42). ■ 


PAC-Bounds 3 and 3’ are expressed in two forms. The first form relies on bounds on 
the Gibbs risk and the expected disagreement (denoted r and d). The second form relies 
on bounds on the first and second moments of the margin (denoted p\ and JFf)- This latter 
form is used to justify the learning algorithm presented in Section 8. 

8. MinCq: Learning by Minimizing the C-bound 

In this section, we propose a new algorithm, that we call MinCq, for constructing a weighted 
majority vote of voters. One version of this algorithm is designed for the supervised induc¬ 
tive framework and minimizes the C-bound. A second version of MinCq that minimizes the 
C-bound in the transductive (or semi-supervised) setting can be found in Laviolette et al. 
(2011). Both versions can be expressed as quadratic programs on positive semi-definite 
matrices. 

As is the case for Boosting algorithms (Schapire and Singer, 1999), MinCq is designed 
to output a Q-weighted majority vote of voters that perform rather poorly individually and, 
consequently, are often called weak learners. Hence, the decision of each vote is based on a 
small majority (i.e., with a Gibbs risk just a bit lower than 1/2). Recall that in situations 
where the Gibbs risk is high (i.e., the first moment of the margin is close to 0), the C-bound 
can nevertheless remain small if the voters of the majority vote are maximally uncorrelated. 

Unfortunately, minimizing the empirical value of the C-bound tends to overfit the data. 
To overcome this problem, MinCq uses a distribution Q of voters which is constrained to 
be quasi-uniform (see Equation 37) and for which the first moment of the margin is forced 
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to be not too close to 0. More precisely, the value h\{Mq) is constrained to be bigger than 
some strictly positive constant /j. This p then becomes a hyperparameter of the algorithm 
that has to be fixed by cross-validation, as the parameter C is for SVM. This new learning 
strategy is justified by PAC-Bound 3, dedicated to quasi-uniform posteriors 16 , and PAC- 
Bound 3’, that is specialized to kernel voters. Hence, MinCq can be viewed as the algorithm 
that simply looks for the majority vote of margin at least // that minimizes PAC-Bound 3 
(or PAC-Bound 3’ in the sample compression case). 

MinCq is also justified by two important properties of quasi-uniform majority votes. 
First, as we shall see in Theorem 43, there is no generality loss when restricting ourselves 
to quasi-uniform distributions. Second, as we shall see in Theorem 44, for any margin 
threshold p > 0 and any quasi-uniform distribution Q such that p\(M q) > p. there is 
another quasi-uniform distribution Q' whose margin is exactly p that achieves the same 
majority vote and therefore has the same C-bound value. 

Thus, to minimize the C-bound, the learner must substantially reduce the variance of 
the margin distribution - i.e., H2 {.Mq) ~ while maintaining its first moment - i.e., p\(Mq) 
- over the threshold p. Many learning algorithms actually exploit this strategy in different 
ways. Indeed, the variance of the margin distribution is controlled by Breiman (2001) for 
producing random forests, by Dredze et al. (2010) in the transfer learning setting, and 
by Shen and Li (2010) in the Boosting setting. Thus, the idea of minimizing the variance of 
the margin is well-known and used. We propose a new theoretical justification for all these 
types of algorithms and propose a novel learning algorithm, called MinCq, that directly 
minimizes the C-bound. 

8.1 From the C-bound to the MinCq Learning Algorithm 

We only consider learning algorithms that construct majority votes based on a (finite) self- 
complemented hypothesis space PL = {fi,, f- 2 n } of real-valued voters. Recall that these 
voters can be classifiers such as decision stumps or can be given by a kernel k evaluated on 
the examples of S such as /)(•) = k(xi,-). 

We consider the second form of the C-bound, which relies on the first two moments of 
the margin of the majority vote classifier (see Theorem 11): 



Our first attempts to minimize the C-bound confronted us with two problems. 

Problem. 1: an empirical C-bound minimization without any regularization tends to overfit 
the data. 

Problem 2: most of the time, the distributions Q minimizing the C-bound Cq are such 
that both fii (Mq) and H2 (Mq) are very close to 0. Since Cq = 1 — (p\(Mq )) 2 / P2(Mq ), 
this gives a 0/0 numerical instability. Since (p\(Mq )) 2 / P2(Mq) can only be empirically 
estimated by (pi(Mq )) 2 / P2(Mq), Problem 2 amplifies Problem 1. 

16. PAC-Bound 3 is dedicated to posteriors Q that are aligned on a prior distribution P, but in this section 
we always consider that the prior distribution P is uniform, thus leading to a quasi-uniform posterior Q. 
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A natural way to resolve Problem 1 is to restrict ourselves to quasi-uniform distributions, 
i.e., distributions that are aligned on the uniform prior (see Section 6.1 for the definition). 
In Section 6, we show that with such distributions, one can upper-bound the Bayes risk 
without needing a KL-regularization term. Hence, according to this PAC-Bayesian theory, 
these distributions have some “built-in” regularization effect that should prevent overfitting. 
Section 7 generalizes these results to the sample compression setting, which is necessary in 
the case where voters such as kernels are defined using the training set. 

The next theorem shows that this restriction on Q does not reduce the set of possible 
majority votes. 


Theorem 43 Let Ft be a self-complemented set. For all distributions Q on Ft, there exists 
a quasi-uniform distribution Q' on Ft that gives the same majority vote as Q, and that has 
the same empirical and true C-bound values, i.e., 


Bq' = H 


Q 


C%,=C% and Cq, = Cq . 


-iD 


Proof Let Q be a distribution on FL = {/i,..., / 2 n}> let M l = max^^ n },| Q(fi+ n ) ~ Q{fi )|, 
and let Q' be defined as 

n' ( f \ 4U — _U ~ Q{fi+n ) 

V Ui) ~ 2 n 2 nM 

where the indices of / are defined modulo 2 n (i.e., f(i+ n )+n = /*)• Then it is easy to show 
that Q' is a quasi-uniform distribution. Moreover, for any example x £ X, we have 


2 n 


E f(x) 
f~Q' 


def 




i= 1 
n 


i= 1 


2 n 


\^2Q(fi)-2Q(f i+n ) 1 


i= 1 

i 


E f(x). 


i= 1 


nM/~Q 

Since nM > 0, this implies that Bq>(x) = Bq(x ) for all x £ X. It also shows that 
Mq’(x, y) = - jjMq(x , y), which implies that (/J i(Mq) )) = and F2(Mq') = 


(dir) for both D' = D and D' = S. 

The theorem then follows from the definition of the C-bound. 


Theorem 43 points out a nice property of the C-bound: different distributions Q that 
give rise to a same majority vote have the same (real and empirical) C-bound values. Since 
the C-bound is a bound on majority votes, this is a suitable property. Moreover, PAC- 
Bounds 3 and 3’, together with Theorem 43, indicate that restricting ourselves to quasi¬ 
uniform distributions is a natural solution to the problem of overfitting (see Problem 1). 
Unfortunately, Problem 2 remains since a consequence of the next theorem is that, among 
all the posteriors Q that minimize the C-bound, there is always one whose empirical margin 
Hi(M q) is as close to 0 as we want. 
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Theorem 44 Let Ft be a self-complemented set. For all p G (0,1] and for all quasi-uniform 
distributions Q on Ft having an empirical margin p\ (Mq) > p, there exists a quasi-uniform 
distribution Q' on Ft, having an empirical margin equal to p, such that Q and Q' induce 
the same majority vote and have the same empirical and true C-bound values, i.e., 

Mi (Mq,) = l J > Bqi = Bq , Cqi = Cq and Cq, = Cq. 

Proof Let Q be a quasi-uniform distribution on FL = {f \,..., f 2 n} such that p\ (Mq) > p. 
We define Q' as 

Q ' {fi) = ' g(/i) + ( x ■ pjlq)) ' 1/2n ’ * G {1, 2n} ' 

Clearly Q' is a quasi-uniform distribution since it is a convex combination of a quasi-uniform 

distribution and the uniform one. Then, similarly as in the proof of Theorem 43, one can 

easily show that E fix) = —tttvt E fix), which implies the result. ■ 

J /~Q' J v ' mi(M«) f ~ Q ^ v F 

Training set bounds (such as VC-bounds for example) are known to degrade when the 
capacity of classification increases. As shown by Theorem 44 for the majority vote setting, 
this capacity increases as p decreases to 0. Thus, we expect that any training set bound 
degrades for small p. This is not the case for the C-bound itself, but the C-bound is not a 
training set bound. To obtain a training set bound, we have to relate the empirical value Cq 
to the true one Cq, which is done via PAC-Bounds 3 and 3’. In these bounds, there is indeed 
a degradation as p decreases because the true C-bound is of the form 1— (pi(Mq)) 2 /P2{Mq). 
Since p = pi(Mq), and because a small pi(Mq) tends to produce small P 2 (Mq), the 

bounds on Cq given Cq that outcomes from PAC-Bounds 3 and 3’ are therefore much 

looser for small p because of the 0/0 instability. As explained in the introduction of the 
present section, one way to overcome the instability identified in Problem 2 is to restrict 
ourselves to quasi-uniform distributions whose empirical margin is greater or equal than 
some threshold p. Interestingly, thanks to Theorem 44, this is equivalent to restricting 
ourselves to distributions having empirical margin exactly equal to p. From Theorems 11 
and 44, it then follows that minimizing the C-bound, under the constraint p\(Mq) > p, is 
equivalent to minimizing P2(Mq), under the constraint p±(Mq) = p , from this observation, 
and the fact that minimizing PAC-Bounds 3 and 3’ is equivalent to minimizing the empirical 
C-bound Cq, we can now define the algorithm MinCq. 

In this section, p always represents a restriction on the margin. Moreover, we say 
that a value p is D'-realizable if there exists some quasi-uniform distribution Q such that 
pi(MR ) = p. The proposed algorithm, called MinCq, is then defined as follows. 

Definition 45 (MinCq Algorithm) Given a self-complemented set FL of voters, a train¬ 
ing set S, and a S'-realizable p > 0, among all quasi-uniform distributions Q of empirical 
margin p i (Mq) exactly equal to p, the algorithm MinCq consists in finding one that mini¬ 
mizes P2(Mq). 

This algorithm can be translated as a simple quadratic program (QP) that has only 
n variables (instead of 2 n), and thus can be easily solved by any QP solver. In the next 
subsection, we explain how the algorithm of Definition 45 can be turned into a QP. 
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8.2 MinCq as a Quadratic Program 

Given a training set S, and a self-complemented set H of voters {/i, f 2 , ■ ■ ■, f- 2 n}, let 


, . def „ 

Mi = E 

(*,yh 


yfi(x ) 


and 


, . def „ 

Mij = E 

Od/U 




Let M be a symmetric n x n matrix, a be a column vector of n elements, and m be a 
column vector of n elements defined by 


M d = 

At2,l 

A4i,2 • 
M 2,2 

• • Atj,n 

• • M 2 ,n 

def 

, a = 

kZllMi, 2 

i def 

, and m = 

'Air 

ai 2 


1 

£ 

A4 n ,2 • 

■ ■ Mln,n 


_n Si=l Mi, n _ 


Min 


Finally, let q be the column vector of n QP-variables, where each element qi represents 
the weight Q(fi). 

Using the above definitions and the fact that "H is self-complemented, one can show that 
— Alj, Mli+nj — AAij- |_ n — AI ij , and qi-\-n — ~ qi ■ 

Moreover, it follows from the definitions of the first two moments of the margin /xi(Mq) 
and (Mq) (see Equations 6 and 8) that 


2 n 2 n 2 n 

K(M$) = £ qi Mi , and = EE QiQj Mij. 

i= 1 i= 1 j= 1 

As MinCq consists in finding the quasi-uniform distribution Q that minimizes /J 2 (Mq), 
with a margin h\(Mq) exactly equal to the hyperparameter n, let us now rewrite ^ 2 (Mq) 
and hi(Mq) using the vectors and matrices defined in Equation (43). It follows that 


2 n 2 n 

M 2 (m|) = 

i=i j =1 


EE[ QiQj Qi-^riQj QiQj+n "f* 1/ i+n ( ij—n 

*= 1 3 =1 


Mi 


= EE 

*=1i=l L 


, 4 1 

4?i qj - -Qi + ^2 


n n* 


M 




n n ^ n n ^ n n 

= 4 E E qiq i --EE* + ^2 E E 

2=1 J = 1 2=1 J = 1 2=1 J = 1 

1 n n 

(q T M q - a T q) + ^EE^’ ( 44 ) 


= 4 


i=i i=i 
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and 

2 n 

Vi(Mq) = 

i= 1 


n 

^ ^ (Qi Qi+n ) = 
i =1 


2m q-V Mi. 

n ^' 


i =1 






2 y qi Mi 

i =1 


1 

n 


i=l 


As the objective function (Mq) and the constraint n\(M q) = n of the QP can 
be defined using only n variables, there is no need to consider in the QP the weights 
of the last n voter. These weights can always be recovered from the n first, because 
Qi+n = ^ — qi, for any i. Note however that to be sure that the solution of the QP has the 
quasi-uniformity property, we have to add the following constraints to the program: 

Qi G [0, E for any i . 

Note that the multiplicative constant 4 and the additive constant 4* X4=i X^j=i 
from Equation (44) can be omitted, as the optimal solution will stay the same. From all that 
precedes and given any S'-realizable fi. MinCq solves the optimization problem described 
by Program 1. 


Program 1 : MinCq - a quadratic program for classification 

Solve argmiiiq q T M q — a T q 

under constraints : m T q= f + i-^i 

and : 0 < qt < \ Mi G {1,..., n} 


To prove that Program 1 is a quadratic program, it suffices to show that M is a positive 
semi-definite matrix. This is a direct consequence of the fact that each A iij can be viewed 
as a scalar product, since 

Mij = (/A/;(*)) „ • (\/t5t fi( x )) „ > where Sx = {x: (x,y) G S}. 

\V 1 1 / W 1 1 / 


Finally, the Q-weighted majority vote output by MinCq is 


b q(x) 


sgn 


E /W 


sgn 


sgn 


sgn 


2 n 




Li=l 
n 


= sgn 


2 n 


^2qifi(x)+ QiM x ) 


- Qi) ■ -/*( 

i= 1 i— 1 

n 

J2( 2qi ~ n)/<( 


Li=l 
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8.3 Experiments 

We now compare MinCq to state-of-the-art learning algorithms in three different contexts: 
handwritten digits recognition , classical binary classification tasks , and Amazon reviews 
sentiment analysis. A context (Lacoste et ah, 2012) represents a distribution on the different 
tasks a learning algorithm can encounter, and a sample from a context is a collection of 
data sets. 

For each context, each data set is randomly split into a training set S and a testing set T. 
When hyperparameters have to be chosen for an algorithm, 5-fold cross-validation is run on 
the training set S, and the hyperparameter values that minimize the mean cross-validation 
risk are chosen. Using these values, the algorithm is trained on the whole training set S, 
and then evaluated on the testing set T. 

For the first two contexts, we compare MinCq using decision stumps as voters (referred 
to as StumpsMinCq), MinCq using RBF kernel functions k(x,x') = exp(— 7 ||a: — x'H 2 ) as 
voters (referred to as RbfMinCq), AdaBoost (Schapire and Singer, 1999) using decision 
stumps (referred to as StumpsAdaBoost), and the soft-margin Support Vector Machine 
(SVM) (Cortes and Vapnik, 1995) using the RBF kernel, referred to as RbfSVM. For the 
last context, we compare MinCq using linear kernel functions k(x,x') = x ■ x' as voters 
(referred to as LinearMinCq), and the SVM using the same linear kernel, referred to as 
LinearSVM. 

For the three variants of MinCq, the quadratic program is solved using CVXOPT (Dahl 
and Vandenberghe, 2007), an off-the-shelf convex optimization solver. 

StumpsAdaBoost: For StumpsAdaBoost, we use decision stumps as weak learners. For 
each attribute, 10 decision stumps (and their complement) are generated, for a total 
of 20 decision stumps per attribute. The number of boosting rounds is chosen among 
the following 15 values: 10, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 500, 750 
and 1000 . 

StumpsMinCq: For StumpsMinCq, we use the same 10 decision stumps per attribute as 
for StumpsAdaBoost. Note that we do not need to consider the complement stumps in 
this case, as MinCq automatically considers self-complemented sets of voters. MinCq’s 
hyperparameter fi is chosen among 15 values between 10~ 4 and 10° on a logarithmic 
scale. 

RbfSVM: The 7 hyperparameter of the RBF kernel and the C hyperparameter of the 
SVM are chosen among 15 values between 10 -4 and 10 1 for 7 , and among 15 values 
between 10° and 10 s for C, both on a logarithmic scale. 

RbfMinCq: For RbfMinCq, we consider 15 values of /r between 10 4 and 10 2 on a loga¬ 
rithmic scale, and the same 15 values of 7 as in SVM for the RBF kernel voters. 

LinearSVM: When using the linear kernel, the C parameter of the SVM is chosen among 
15 values between 10 “ 4 and 10 2 , on a logarithmic scale. All SVM experiments are 
done using the implementation of Pedregosa et al. (2011). 

LinearMinCq: For LinearMinCq, we consider 15 values of n between 10 -4 and 10 ~ 2 on a 
logarithmic scale. 
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Risk R t (Bq) for RbfMinCq 



Risk R t (Bq) for StumpsMinCq 


Figure 8: Comparison of the risks on the testing set for each algorithm and each MNIST bi¬ 
nary data set. The figure on the left shows a comparison of the risks of RbfMinCq 
(x-axis) and RbfSVM (y-axis). The figure on the right compares StumpsMinCq 
(x-axis) and StumpsAdaBoost (y-axis). On each scatter plot, a point represents a 
pair of risks for a particular MNIST binary data set. A point above the diagonal 
line indicates better performance for MinCq. 


Statistical Comparison Tests 

RbfMinCq vs RbfSVM StumpsMinCq vs StumpsAdaBoost 

Poisson binomial test 88% 99% 

Sign test (p-value) 0.01 0.00 


Table 2: Statistical tests comparing MinCq to either RbfSVM or StumpsAdaBoost. The 
Poisson binomial test gives the probability that MinCq has a better performance 
than another algorithm on this context. The sign test gives a p -value representing 
the probability that the null hypothesis is true (i.e., MinCq and the other algorithm 
both have the same performance on this context). 


When using the RBF kernel for the SVM or MinCq, each data set is normalized using a 
hyperbolic tangent. For each example x, each attribute xi,X 2 , ■■ ■ ,x n is renormalized with 


x- = tanh 


, where Xj and eq are the mean and standard deviation of the i th attribute 


respectively, calculated on the training set S. Normalizing the features when using the RBF 
kernel is a common practice and gives better results for both MinCq and SVM. Empirically, 
we observe that the performance gain of RbfMinCq with normalized data is even more 
significant than for RbfSVM. 
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8.3.1 Handwritten Digits Recognition Context 

The first context of interest to compare MinCq with other learning algorithms is the hand¬ 
written digits recognition. For this task, we use the MNIST database of handwritten digits 
of Lecun and Cortes. We split the original data set into 45 binary classification tasks, where 
the union of all binary data sets recovers the original data set, and the intersection of any 
pair of binary data sets gives the empty set. Therefore, any example from the original data 
set appears on one and only one binary data set, thus avoiding any correlation between the 
binary data sets. For each resulting binary data set, we randomly choose 500 examples to 
be in the training set S, and the testing set T consists of the remaining examples. Figure 8 
shows the resulting test risk for each binary data set and each algorithm. 

Table 2 shows two statistical tests to compare the algorithms on the handwritten 
digits recognition context: the Poisson binomial test (Lacoste et ah, 2012) and the sign 
test (Mendenhall, 1983). Both methods suggest that RbfMinCq outperforms RbfSVM on 
this context, and that StumpsMinCq outperforms StumpsAdaBoost. 

8.3.2 Classical Binary Classification Tasks Context 

This second context of interest is a more general one: it consists of multiple binary clas¬ 
sification data sets coming from the UCI Machine Learning Repository (Blake and Merz, 
1998). These data sets are commonly used as a benchmark for learning algorithms, and 
may help to answer the question “How well may a learning algorithm perform on many 
unrelated classification tasks”. For each data set, half of the examples (up to a maximum 
of 500) are randomly chosen to be in the training set S, and the remaining examples are in 
the testing set T. Table 3 shows the resulting test risks on this context, for each algorithm. 

Table 3 also shows a statistical comparison of all algorithms on the classical binary 
classification tasks context, using the Poisson binomial test and the sign test. On this 
context, both statistical tests show no significant performance difference between RbfMinCq 
and RbfSVM, and between StumpsMinCq and StumpsAdaBoost, implying that these pairs 
of algorithms perform similarly well on this general context. 

8.3.3 Amazon Reviews Sentiment Analysis 

This context contains 4 sentiment analysis data sets, representing product types (books, 
DVDs, electronics and kitchen appliances). The task is to learn from an Amazon.com 
product user review in natural language, and predict the polarity of the review, that is 
either negative (3 stars or less) or positive (4 or 5 stars). The data sets come from Blitzer 
et al. (2007), where the natural language reviews have already been converted into a set 
of unigrams and bigrams of terms, with a count. For each data set, a training set of 1000 
positive reviews and 1000 negative reviews are provided, and the remaining reviews are 
available in a testing set. The original feature space of these data sets is between 90,000 
and 200, 000 dimensions. However, as most of the unigrams and bigrams are not significant 
and to reduce the dimensionality, we only consider unigrams and bigrams that appear 
at least 10 times on the training set (as in Chen et ah, 2011), reducing the numbers of 
dimensions to between 3500 and 6000. Again as in Chen et al. (2011), we apply standard 
tf-idf feature re-weighting (Salton and Buckley, 1988). Table 4 shows the resulting test 
risks for each algorithm. 
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Data Set Information Risk Rt ( Bq ) for Each Algorithm 


Name 

\S\ 

\ T \ 

RbfMinCq 

RbfSVM 

StumpsMinCq 

StumpsAdaBoost 


Australian 

345 

345 

0.142 

0.133 

0.165 

0.168 

Balance 

313 

312 

0.054 

0.042 

0.042 

0.032 

Breast Cancer 

350 

349 

0.037 

0.046 

0.037 

0.060 

Car 

500 

1228 

0.074 

0.032 

0.320 

0.291 

Cmc 

500 

973 

0.303 

0.306 

0.140 

0.134 

Credit-A 

345 

345 

0.122 

0.133 

0.304 

0.308 

Cylinder 

270 

270 

0.204 

0.233 

0.125 

0.148 

Ecoli 

168 

168 

0.077 

0.071 

0.289 

0.289 

Flags 

97 

97 

0.289 

0.320 

0.071 

0.071 

Glass 

107 

107 

0.206 

0.206 

0.268 

0.309 

Heart 

135 

135 

0.163 

0.156 

0.262 

0.271 

Hepatitis 

78 

77 

0.169 

0.143 

0.185 

0.185 

Horse 

184 

184 

0.185 

0.196 

0.169 

0.221 

Ionosphere 

176 

175 

0.114 

0.069 

0.245 

0.174 

Letter:AB 

500 

1055 

0.007 

0.003 

0.109 

0.120 

LetteriDO 

500 

1058 

0.021 

0.018 

0.005 

0.010 

LetteriOQ 

500 

1036 

0.023 

0.036 

0.020 

0.048 

Liver 

173 

172 

0.267 

0.285 

0.042 

0.052 

Monks 

216 

216 

0.245 

0.208 

0.306 

0.236 

Nursery 

500 

12459 

0.025 

0.026 

0.025 

0.026 

Optdigits 

500 

3323 

0.034 

0.027 

0.089 

0.089 

Pageblock 

500 

4973 

0.045 

0.048 

0.059 

0.055 

Pendigits 

500 

6994 

0.007 

0.008 

0.069 

0.084 

Pima 

384 

384 

0.253 

0.255 

0.273 

0.250 

Segment 

500 

1810 

0.017 

0.018 

0.040 

0.022 

Spambase 

500 

4101 

0.067 

0.077 

0.133 

0.070 

Tic-tac-toe 

479 

479 

0.033 

0.025 

0.330 

0.353 

US vote 

218 

217 

0.051 

0.051 

0.051 

0.051 

Wine 

89 

89 

0.034 

0.045 

0.169 

0.034 

Yeast 

500 

984 

0.286 

0.279 

0.324 

0.306 

Zoo 

51 

50 

0.040 

0.060 

0.060 

0.040 


Statistical Comparison Tests 

RbfMinCq vs RbfSVM StumpsMinCq vs StumpsAdaBoost 

Poisson binomial test 54% 48% 

Sign test (p-value) 0.36 0.35 


Table 3: Risk on the testing set for all algorithms, on the classical binary classification task 
context. See Table 2 for an explanation of the statistical tests. 


Table 4 also shows a statistical comparison of the algorithms on this context, again using 
the Poisson binomial test and the sign test. LinearMinCq has an edge over LinearSVM, 
as it wins or draws on each data set. However, both statistical tests show no significant 
performance difference between LinearMinCq and LinearSVM. 

These experiments show that minimizing the C-bound, and thus favoring majority votes 
for which the voters are maximally uncorrelated, is a sound approach. MinCq is very 
competitive with both AdaBoost and the SVM on the classical binary tasks context and 
the Amazon reviews sentiment analysis context. MinCq even shows a highly significant 
performance gain on the handwritten digits recognition context, implying that on certain 
types of tasks or data sets, minimizing the C-bound offers a state-of-the-art performance. 
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Data Set Information 

Risk R't(Bq) for Each Algorithm 

Name 

\S\ 

m 

LinearMinCq 

LinearSVM 


Books 

2000 

4465 

0.158 

0.158 

DVD 

2000 

3586 

0.162 

0.163 

Kitchen 

2000 

5945 

0.130 

0.131 

Electronics 

2000 

5681 

0.116 

0.118 


Statistical Comparison Tests 


LinearMinCq vs LinearSVM 

Poisson binomial test 68% 

Sign test (p- value) 0.31 


Table 4: Risk on the testing set for all algorithms, on the Amazon reviews sentiment analysis 
context. See Table 2 for an explanation of the statistical tests. 


However, for all above experiments, we observe that the empirical values of the PAC- 
Bounds are trivial (close to 1). Remember that, inspired by PAC-Bounds 3 and 3’, the 
MinCq algorithm learns the weights of a majority vote by minimizing the second moment 
of the margin while fixing its first moment /r to some value. In these experiments, the value 
of /d chosen by cross-validation is always very close to 0 (basically, /x = 10 -4 ). This implies 
that Cq = 1 — 


M2 




— - is very close to the 1 — § form, leading to a severe degradation of 


PAC-Bayesian bounds for Cq. Note that the voters were all weak in the former experiments. 
This explains why very small values of n were selected by cross-validation. 


8.3.4 Experiments with Stronger Voters 

In the following experiment, we show that one can obtain much better bound values by using 
stronger voters, that is, voters with a better individual performance. To do so, instead of 
considering decision stumps, we consider decision trees. 17 We use 100 decision tree classifiers 
generated with the implementation of Pedregosa et al. (2011) (we set the maximum depth 
to 10 and the number of features per node to 1). By using these strong voters, it is possible 
to achieve higher values of //. 18 

Figure 9 shows the empirical C-bound value and its corresponding PAC-Bayesian bound 
values for multiple values of /i on the Mushroom UCI data set. From the 8124 examples, 500 
have been used to construct the set of voters, 4062 for the training set, and the remaining 
examples for the testing set. The figure shows the PAC-Bayesian bounds get tighter when 
H is increasing. Note however that the empirical C-bound slightly increases from 0.001 to 
0.016. The risk on the testing set of the majority vote (not shown in the figure) is 0 for 
most values of /a, but also increases a bit for the highest values (remaining below 0.001). 


17. A decision stump can be seen as a (weak) decision tree of depth 1. 

18. Note that the set of decision trees was learned on a fresh set of examples, disjoint from the training data. 
We do so to ensure that all computed PAC-Bounds are valid, even if they are not designed to handle 
sample-compressed voters. 


847 



Germain, Lacasse, Laviolette, Marchand and Roy 



Figure 9: Values of empirical C-bound and corresponding PAC-Bounds 0, 1, 2, 2’ and 3 on 
the majority votes output by MinCq, for multiple values of ji. 


Hence, we obtain tight bounds for high values of /x (PAC-Bounds 2 and 2’ are under 
0.2). Nevertheless, these PAC-Bayesian bounds are not tight enough to precisely guide the 
selection of /i. This is why we rely on cross-validation to select a good value of //. 

Finally, we also see that PAC-Bound 3 is looser than other bounds over Cq, but this was 
expected as it was not designed to be as tight as possible. That being said, PAC-Bound 3 
has the same behavior than PAC-Bounds 1 and 2. This suggests that we can rely on it to 
justify the MinCq learning algorithm once the hyperparameter is fixed. 

9. Conclusion 

In this paper, we have revisited the work presented in Lacasse et al. (2006) and Laviolette 
et al. (2011). We clarified the presentation of previous results and extended them, as well as 
actualizing the discussion regarding the ever growing development of PAC-Bayesian theory. 

We have derived a risk bound (called the C-bound) for the weighted majority vote 
that depends on the first and the second moment of the associated margin distribution 
(Theorem 11). The proposed bound is based on the one-sided Chebyshev inequality, which, 
under the mild condition of Proposition 14, is the tightest inequality for any real-valued 
random variable given only its first two moments. Also, as shown empirically by Figure 3, 
this bound has a strong predictive power on the risk of the majority vote. 

We have also shown that the original PAC-Bayesian theorem, together with new ones, 
can be used to obtain high-confidence estimates of this new risk bound that holds uniformly 
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for all posterior distributions. We have generalized these PAC-Bayesian results to the (more 
general) sample compression setting, allowing one to make use of voters that are constructed 
with elements of the training data, such as kernel functions yik(xi,-). Moreover, we have 
presented PAC-Bayesian bounds that have the uncommon property of having no Kullback- 
Leibler divergence term (PAC-Bounds 3 and 3’). These bounds, together with the C-bound, 
gave the theoretical foundation to the learning algorithm introduced at the end of the 
paper, that we have called MinCq. The latter turns out to be expressible in the nice form 
of a quadratic program. MinCq is not only based on solid theoretical guarantees, it also 
performs very well on natural data, namely when compared with the state-of-the-art SVM. 

This work tackled the simplest problem in machine learning (the supervised binary clas¬ 
sification in presence of i.i.d. data), and we now consider that the PAC-Bayesian theory is 
mature enough to embrace a variety of more sophisticated frameworks. Indeed, in the recent 
years several authors applied this theory to many more complex paradigms: Transductive 
Learning (Derbeko et al., 2004; Catoni, 2007; Begin et al., 2014), Domain Adaptation (Ger¬ 
main et al., 2013), Density Estimation (Seldin and Tishby, 2009; Higgs and Shawe-Taylor, 
2010), Structured output Prediction (McAllester, 2007; Giguere et al., 2013; London et al., 
2014), Co-clustering (Seldin and Tishby, 2009, 2010), Martingales (Seldin et al., 2012), U- 
Statistics of higher order (Lever et al., 2013) or other non-i.i.d. settings (Ralaivola et al., 
2010), Multi-armed Bandit (Seldin et al., 2011) and Reinforcement Learning (Fard and 
Pineau, 2010; Fard et al., 2011). 
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Appendix A. Auxiliary mathematical results 

Lemma 46 (Markov’s inequality) For any random variable X such that E(A) = y, 
and for any a > 0, we have 

Pr(|A| > a) < ^ . 

a 

Lemma 47 (Jensen’s inequality) For any random variable X and any convex func¬ 
tion f, we have 

/(E [X]) < E [/(A)]. 
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Lemma 48 (One-sided Chebyshev inequality) For any random variable X such that 
E(X) = n and Var(X) = a 2 , and for any a > 0, we have 


Pr (X - /X > a) < 9 9 

' ' cr 2 + a 2 


a 


Proof First observe that Pr( X — fi > a ) < Pr 


X -»+T. 

apply Markov’s inequality (Lemma 46) to bound this probabi 

0 1 2 


1 2 


> 


1 2 


ttl-V 


. Let us now 


Pr 


X-yL + 


a 


2h2 


> 


a + 


(7 


21 2 


E 


< 


X-n+°- 


1 2 


a+X- 


ity. We obtain 

(Markov’s inequality) 


E(X - + 2(^)e(X - ^) + (^ 


1 2 


a + x 


(T Z + ^ 


^(1 + Sr 


<7 


1 2 


a + vr 


(a 2 + a 2 ) (1 + 


a 2 + a 2 


because E {X - /.i) 2 = Var(X) = a 2 and E{X — n) = E(X) - E(X) = 0. 


Note that the proof Theorem 49 (below) by Cover and Thomas (1991) considers that 
probability distributions Q and P are discrete, but their argument is straightforwardly 
generalizable to continuous distributions. 

Theorem 49 (Cover and Thomas, 1991, Theorem 2.7.2) The Kullback-Leibler divergence 
KL(Q||P) is convex in the pair ( Q,P ), i.e., if (Q\,P\) and {Q 2 ,P 2 ) are two pairs of proba¬ 
bility distributions, then 

KL(AQi + (1 -A)Q 2 ||APi + (1-A)P 2 ) < AKL(Qi||Pi) + (1-A)KL(Q 2 ||P 2 ), 
for all A £ [0,1] . 

Corollary 50 Both following functions are convex: 

1. The function kl(q||p) of Equation (21), i.e., the Kullback-Leibler divergence between 
two Bernoulli distributions; 

2. The function kl(gi,g 2 ||pi,p 2 ) of Equation (31), i.e., the Kullback-Leibler divergence 
between two distributions of trivalent random variables. 

Proof Straightforward consequence of Theorem 49. ■ 
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Lemma 51 (Maurer, 2004) Let X be any random variable with values in [0,1] and expec¬ 
tation n = E(X). Denote X the vector containing the results of n independent realizations 
of X. Then , consider a Bernoidli random variable X' ({0,1}-valued) of probability of suc¬ 
cess y, i.e., Pr(X' = 1) = y. Denote X' £ {0, l} n the vector containing the results of n 
independent realizations of X'. 

If function f : [0, l] n —> M is convex, then 

E[/(X)] < E[/(X')] • 


The proof of Lemma 52 (below) follows the key steps of the proof of Lemma 51 by Maurer 
(2004), but we include a few more mathematical details for completeness. Interestingly, the 
proof highlights that one can generalize Maurer’s lemma even more, to embrace random 
variables of any (countable) number of possible outputs. Note that another generalization 
of Maurer’s lemma is given in Seldin et al. (2012) to embrace the case where the random 
variables X \,..., X n are a martingale sequence instead of being independent. 

Lemma 52 (Generalization of Lemma 51) Let the tuple (X,Y) be a random variable 
with values in [0, l] 2 , such that X + Y <1, and with expectation (yx, Ty) = (E(X), E(Y)). 
Given n independent realizations of (X, Y), denote X = (X \,..., X n ) the vector of cor¬ 
responding X-values and Y = (Y\,... ,Y n ) the vector of corresponding Y-values. Then, 
consider a random variable (X',Y f ) with three possible outcomes, (1,0), (0,1) and (0,0), of 
expectations fix, Ty and 1—yx—yy, respectively. Denote X', Y' £ {0,1}" the vectors of n 
independent realizations of (X',Y'). 

If a function f : [0, l] n x [0, l] n —> M is convex, then 

E[/(X,Y)] < E[/(X',Y')]. 

Proof Given two vectors x = (aq,..., x n ),y = (y \,..., y n ) £ [0, l] n , let us define 

(x, y) {(x 1 ,y 1 ),(x 2 ,y 2 ),---,(x n ,y n )) £ ([0,1] x [0, l]) n . 

Consider H = {(1, 0), (0,1), (0, 0)}. Lemma 53 (below) shows that any point (x, y) can be 
written as a convex combination of the extreme points r/ = (r) i, 772 , • • •, f ] n ) £ H n : 


(x,y) 


£ 




n 1 

} : m =( 0.0) 


-Xi-yi 


■r). 


Convexity of function / implies 


(45) 


/(x,y) < £ 

rj EH n 





■f(v), 


(46) 


with equality if (x, y) £ H n = {(1,0), (0,1), (0,0)} n , because the elements of the sum are 
0 -/(j 7 ) for all rj £ H n \ {(x, y)} and l- f(rj) only for 77 = (x, y). 
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Given that realizations of random variable (X,Y) are independent and that for a given 
rji £ H, only one of the three products is computed 19 , we get 


E[/(X,Y)] < E 


E 

T) £H n 


n Xi n y ' ( n 


i.o) 


1»:J7»=(0,1) / \i: Vi =(0,0) 


E E 

77 EH n 

E 

77 CH n 

E 

77 EH n 


ii x ’ \ nI n 


1,0) 

n w 

n > ,x 


ii:r)i=(0,l) / \i:77i=(0,0) 


• f(v) 

/(»?) 


n E ( y .> n l-E(X i )-E(Y < ) 


n ^ 

0,1) 


^i:77i = ( 0 , 0 ) 

]^[ l-HX-HY 
\.i:rii=(0,0) 


f(v) ■ 


/(» 7 ) 


This becomes an equality when (X, Y) takes values in H n (as we explain after equation 46). 
We therefore conclude that E[/(X, Y)] < E[/(X', Y')] . ■ 


Lemma 53 (Proof of Equation 45) Consider H = {(1, 0), (0, 1), (0,0)} and an integer 
n > 0 . Any point (x, y) £ ([0,1 ] x [0,l]) n can be written as a convex combination of the 
extreme points rj = (771,772,, r] n ) £ H n : 


where 


( x ,y) = E Pn(.^y) V, 

r)&H n 


p»?(x, y) = f 



n ip 

i:rji=( 0,1) 


PI 1-Xi-y* 

ii:»7i=(0,0) 


Proof We prove the result by induction over vector size n. 
Proof for n = 1 : 


E • ((!,0)) + yi • ((0,1)) + {1-xi-yi) • ((0,0)) 

n&H 


Proof for n > 1: We suppose that the result is true for any vector (x, y) of a particular size 
n (this is our induction hypothesis) and we prove that it implies 


E 


(rhWn+l) &H n + X 


P(r 7 , )? „+i)((x,y), (xn+l,2/n+l)) ■ {r],r] n+1 ) = ((x, y), {x n+ l, Vn+l)) , 


where (a, b ) denotes a vector a, augmented by one element b. 


19. The equality between the second and third lines follows from the fact that each expectation inside the 
sum of Line 2 can be rewritten as the following product of independent random variables: 


E[n^»(w y *)] 

'Hi 


with g Vi (Xi,Yi) = 


(Xi if ( 1 , 0 ) 

1 Yi if m = (0, 1) 

I 1 — Xi— Yi otherwise. 
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We have 


Y P(r,, Vn+ i){{x,y),(x n+ 1 ,y n+1 )) ■(r),r] n + 1) 

(WiVn+i) eiT n+1 

= Y ^( x ’ y ) ■ Xn+i ■ ( T7, (*> °))+ Y p ^' y ) ■ Vn + i ■ (°> x )) 


rjeiT" 


r]£H n 


+ X! ^( X ’ y ) ' (i-Xn+l-Vn+l) ■ {V, (°,°)) 


r/eH n 


( Y ^( X ’ y ) ' (Xn+l+yn+l + l-X n+ l-yn+l) ' ?7, ^ Pr,( x , y) ' (xra+l,2/n+l) I 

V »7e-fJ n / 


^ Pr,(x,y)-r/, ^ /3^(x,y) • (x n+ i,y n+ i) 


, r)£H n 


r)&H n 


((x,y), (Zn+l, 2 /n+l)) • 


For the last equality, the (x, y) term of the vector above is obtained from the induction 
hypothesis and the last couple is a direct consequence of the following equality: 


n 

Y Pv^y) = n [xi+yi + l-Xi-y^j = 1 . 

r]£H n i =1 


Proposition 54 (Concavity of Equation 36) The function F c (d,e ) is concave. 

Proof We show that the Hessian matrix of F c ( d , e) is a negative semi-definite matrix. In 
other words, we need to prove that 


d 2 F c (d, e ) 
dd 2 

Indeed, we have 


< 0 : 


d 2 F c (d,c) d 2 F c (d,e)d 2 F c (d,e) / d 2 F c (ri,e) ' 

<9e 2 — ’ dd 2 de 2 \ ddde ~ 


d 2 F c (d,e) 2(1-4e) 5 


dd 2 (2 d - l) 3 

d 2 F c (d,e) f 


<0 VeG [0,1], dG 


de 2 


2d — 1 


<0 Ve G [0,1], d G 


°’i 
L 0, 2 ] ’ 


d 2 F c (d, e) d 2 F c (d, e) / d 2 F c (d, e ) ^ 2 _ 2(1 - 4e) 2 8 

dd 2 de 2 V ddde 


4 — 16e 


(2d-l) 3 2d — 1 V( 1 - 2 ^) 


= 0 . 
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Appendix B. A General PAC-Bayesian Theorem for Tuples of Voters and 
Aligned Posteriors 

This section presents a change of measure inequality that generalizes both Lemmas 30 
and 34, and a PAC-Bayesian theorem that generalizes both Theorems 31 and 35. As these 
generalizations require more complex notation and ideas, it is provided as an appendix and 
the simpler versions of the main paper have separate proofs. 

Let % be a countable self-complemented set real-valued functions. In the general setting, 
we recall that T~L is self-complemented if there exists a bijection c : FL —>• FL such that 
c(/) = —/ for any / £ FL. Moreover, for a distribution Q aligned on a prior distribution P 
and for any / £ FL, we have 

Q(f) + Q(c(f)) = P(f) + P(c(/)). 

First, we need to define the following notation. Let k be a sequence of length k, containing 

_ fc 

numbers representing indices of voters. Let f k : X —> y be a function that outputs a tuple 
of votes, such that f k (x) = f {f kl (x),f kk (x )). 

Let us recall that P k and Q k are Cartesian products of probability distributions P 
and Q. Thus, the probability of drawing f k ~ Q k is given by 


k 

Q k (f k) d = Q(/ kl ) • QU k 2 ) • • • • • QC/kJ = n Q(/ k! ) • 

i —1 


Finally, for each f k and each j £ {0,..., 2 k — 1}, let 


/ k BI W d i'(/l*‘ ) (x),...,/« ) (x)) 


where is the binary representation of the number j, and where /Cl = f and 

/ (1) = c(/). Note that /[ 0] = / k . 

To prove the next PAC-Bayesian theorem, we make use of the following change of 
measure inequality. 


Theorem 55 (Change of measure inequality for tuples of voters and aligned pos¬ 
teriors) For any self-complemented setFL, for any distribution P on FL, for any distribution 
Q aligned on P, and for any measurable function f> : T~L k M for which </>(/^) = 4>{f k 
for any j,j' £ {0,..., 2 k — 1}, we have 

E <K/ k ) < Inf E e^ /k) 

/k~Q fc V/k~f’ fc 
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Proof First, note that one can change the expectation over Q k to an expectation over P k , 
using the fact that 4>(f k ^) = 4>{f k ^) for any j,f € {0,..., 2 k — 1} and that Q is aligned 
on P. 


2 k ■ E <X/ k ) 

/k~Q fe 


>n k 


)H k 


in 1 


df k Q fc (/£° ] ) <t>(f k ) + [ df k Q fc (/i 1] ) <X/£ ] ) + .••+/ df k Q fe (/f- 1] ) 0(/f " 1] ) 

JjA' JH k 


d/kQ fe (/D0(/k) + / #kQ fc (/f ] M/k) + ••• + / d/kQ fc (/k 1J )<K/k) 


fc, ,[2 fc -ih 


m k 


2—1 


df k E (^(/k 1 )) ^(/k 


i=o 

2 fc -l / fc 


/•H 


4>{fk 


d h v (n c(/fc’) 

j=0 \i=l L J 

k 

k dfk Hfk 

^ fc i=l 
fc 

dfk n (WO + QW/k,))] <K/k) 

1=1 

fc 

dfk n [E(/ kl ) + P(c(/ k J)] <K/k 


/H fc 




(47) 

(48) 


7=1 


= 2 fc • E 0(/k), 

fw~P k 


where we obtain Line (48) from Line (47) by developing the terms of the product of Line (48). 

The result is obtained by changing the expectation over to an expectation over P k , 
and then by applying Jensen’s inequality (Lemma 47, in Appendix A). 


E <j){f k ) = E Mf k ) = E lne^ < In f E e^ /k) 
/k~Q fc / k ~P fc f^P k \f^P k 
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Theorem 56 (General PAC-Bayesian theorem for tuples of voters and aligned 
posteriors) For any distribution D on X x y, any self-complemented set Ft of voters 
X —> y, any prior distribution P on Ft, any integer k > 1, any convex function V : 
[0,1] x [0,1] -> M and loss function C : y k x y k -> [0,1] for which V ^Eg(/j*^), = 

V ( E s(/k ] )> ')) > f° r an V F? e {0,..., 2 fc -l}, for any m! > 0 and any 5 <E (0,1], 

we have 


Pr 

S~D m 


For all posteriors Q aligned on P : 

, 1 

E Eg(/ k ), E JE£(/ k ) ] j£ — I In 


V 


f k ~Q k 


E 

f k ~Q k 


m 


\ E E e m '' T, ( E S (/ k ) ’ E o(A))' 

f k ~pk 


>1-5. 


Proof This proof follows most of the steps of Theorem 18. 

We have that E e m ■ t, { e s(X1 e d(P'>) a non-negative random variable. 
/k~P fc 


inequality, we have 


By Markov’s 


Pr 

S~D m 


f E e ™'-®(Eg(/k),E£(/ k )) 
\fk~p k 


< 


- E E e ™'^(Ef(/ k ),E£(/ k )) 

5s~D m f k ~P k 


>1-5 . 


Hence, by taking the logarithm on each side of the innermost inequality, we obtain 


Pr (In 

E e m'.©(Ef(/ k ),Eg(/ k )) 

S~D m \ 

Jk~P k 


< In 


I e E e ^'-®(Eg(/k), E £(/k)) 

5 S~D m f k ~P k 


>1-5 . 


Now, instead of using the change of measure inequality of Lemma 17, we use the change 
of measure inequality of Theorem 55 on the left side of innermost inequality, with </>(/k) = 
m! • V (E^(/k), E£(/k)). We then use Jensen’s inequality (Lemma 47, in Appendix A), 
exploiting the convexity of V. 


V Q aligned on P : 


In 


E e ™'-^(Ef(/ k ),E£(/ k )) 
Jk-P k 


> m'■ E P(Eg(/ k ),E£(/ k )) 

/k~Q fc 

> m!■ V( E Ei(/ k ), E E£(/ k )). 

f k ~Q k fk~Q k 


We therefore have 


/ For all posteriors Q aligned on P : 


S~Em E Ef(/ k ), E E£(/ k ))<ln 

\ fk~Q k /k ~Q k 

I E E e m'-D(Ef (/ k ),Eg(/ k )) 

/ k ~P* 


>1-5. 


The result then follows from easy calculations. 
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